Overview
SableDB
uses a Key / Value database for its underlying data storage. We chose to use RocksDB
as its mature, maintained and widely used in the industry by giant companies.
Because the RocksDB
is key-value storage and Redis data structures can be more complex, an additional
data encoding is required.
This chapter covers how SableDB
encodes the data for the various data types (e.g. String
, Hash
, Set
etc)
Note
Numbers are encoded using Big Endians to preserve lexicographic ordering
SableDB
takes advantage of the following RocksDB
traits:
RocksDB
keys are stored lexicographically (this is whySableDB
uses big-endiands)RocksDB
provides prefix iterators which allowsSableDB
to place iterator on the first item that matches a prefix
The String
data type
The most basic data type in SableDB
is the String
data type. String
s in SableDB
are always binary safe
Each String
record in the SableDB
consists of a single entry in RocksDB
:
A B C D E F G H
+-----+-----+-------+----------+ +-----+------------+----+-------+
| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |
+-----+-----+-------+----------+ +-----+------------+----+-------+
The key for a String
record is encoded as follows:
A
the first byte (u8
) is always set to1
- this indicates that this is a data entry (there are other type of keys in the database)B
the database ID is encoded asu16
(this implies thatSableDB
supports up to64K
databases)C
the slot numberD
the actual key value (e.g.set mykey myvalue
->mykey
is set here)
The value is encoded as follows:
E
the first byte is the type bit, value of0
means that the this record is of typeString
F
the record expiration infoG
unique ID (relevant for complex types likeHash
), forString
this is always0
H
the user value
Using the above encoding, we can now understand how SableDB
reads from the database. Lets have a look a the command:
get mykey
SableDB
encodes a key from the user key (mykey
) by prepending the following:
1
u8 - to indicate that this is the data record- The active database number (defaults to
0
) - The slot number
- The user string key (i.e.
mykey
)
This is the key that is passed to RocksDB
for reading
- If the key exists in the database:
- If the type (field E
) is != 0
- i.e. the entry is not a String
, SableDB
returns a -WRONGTYPE
error
- If value is expired -> SableDB
returns null
and deletes the record from the database
- Otherwise, SableDB
returns the H
part of the value (the actual user data)
- Else (no such key) return null
The List
data type
A List
is a composite data type. SableDB
stores the metadata of the list using a dedicated record and each list element
is stored in a separate entry.
List metadata:
A B C D
+-----+---- +--------+------------+
| 1u8 | DB# | Slot# | list name |
+-----+---- +--------+------------+
E F G H I J
+-----+------------+--------- +------+------+-------+
=> | 1u8 | Expirtaion | List UID | head | tail | size |
+-----+------------+--------- +------+------+-------+
List item:
K L M N O P
+-----+--------------+---------------+ +------+--------+------------+
| 2u8 | List ID(u64) | Item ID(u64) | => | Left | Right | value |
+-----+--------------+---------------+ +------+--------+------------+
Unlike String
, a List
is using an additional entry in the database that holds the list metadata.
- Encoded items
A
->D
are the same asString
E
the first byte is always set to1
(unlikeString
which is set to0
)F
Expiration infoG
The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)H
the UID of the list head item (u64
)I
the UID of the list tail item (u64
)J
the list length
In addition to the list metadata (SableDB
keeps a single metadata item per list) we add a list item per new list item
using the following encoding:
K
the first bit which is always set to2
("List Item")L
the parent list ID (see fieldG
above)M
the item UIDN
the UID of the previous item in the list (0
means that this item is the head)O
the UID of the next item in the list (0
means that this item is the last item)P
the list value
The above encoding allows SableDB
to iterate over all list items by creating a RocksDB
iterator and move it to
the prefix [ 2 | <list-id>]
(2
indicates that only list items should be scanned, and list-id
makes sure that only
the requested list items are visited)
The Hash
data type
Hash items are encoded using the following:
Hash metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Hash item:
P Q R S
+-----+--------------+-------+ +-------+
| 3u8 | Hash ID(u64) | field | => | value |
+-----+--------------+-------+ +-------+
- Encoded items
A
->H
are basically identical to the hashA
->H
fields P
always set to3
("hash member")Q
the hash ID for which this member belongs toR
the hash fieldS
the field's value
The Sorted Set
data type
The sorted set ( Z*
commands) is encoded using the following:
Sorted set metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
ZSet item 1 (Index: "Find by member"):
K L M O
+-----+--------------+---------+ +-------+
| 4u8 | ZSet ID(u64) | member | => | score |
+-----+--------------+---------+ +-------+
ZSet item 2 (Index: "Find by score"):
P Q R S T
+-----+--------------+-------+-------+ +------+
| 5u8 | ZSet ID(u64) | score |member | => | null |
+-----+--------------+-------+-------+ +------+
Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.
The zset metadata contains:
- Encoded items
A
->D
are the same asString
E
will always contains3
forsorted set
F
the expiration infoG
the unique zset IDH
the set size (number of members)
Each zset item are kept using 2 records:
Index: "Find by member"
The first record allows SableDB
to find a member score (the key is the member value)
K
the first bit which is always set to4
("ZSet member Item")L
the zset ID for which this item belongs toM
the zset memberO
this member score value
Index: "Find by score"
The second record, allows SableDB
to find member by score (we use the score as the key)
P
the first bit is always set to5
("Zset score item")Q
the zset ID for which this item belongs toR
the record's score valueS
the memberT
not used
The above encoding records provides all the indexing required by SableDB
to implement the sorted set commands.
For example, in order to implement the command ZCOUNT
(Returns the number of elements in the sorted set at key with a score between min and max):
SableDB
first loads the metadata using the zset key in order to obtain its unique ID- Creates an iterator using the prefix
[5 | ZSET UID | MIN_SCORE]
(Index: "Find by score") - Start iterating until it either finds the first entry that does not belong to the zset, or it finds the
MAX_SCORE
value
The Set
data type
Set items are encoded using the following:
Set metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Set item:
P Q R S
+-----+--------------+-------+ +------+
| 6u8 | Set ID(u64) | field | => | null |
+-----+--------------+-------+ +------+
- Encoded items
A
->H
are basically identical to the sorted setA
->H
fields P
always set to6
("set member")Q
the set ID for which this member belongs toR
the set fieldS
null (not used)
Bookkeeping records
Every composite item (Hash
, Sorted Set
, List
or Set
) created by SableDB
, also creates a record in the bookkeeping
"table".
A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
The bookkeeping
record is encoded as follows:
Bookkeeping:
A B C D E
+-----+----+--------+-----------+ +----------+
| 0u8 | UID| DB# | UID type | => | user key |
+-----+----+--------+-----------+ +----------+
A
a bookkeeping records starts with0
B
au64
field containing the composite item UID (e.g.Hash UID
)C
the database ID for which the UID belongs toD
the UID type when it was created (e.g. "hash" or "set")E
the user key associated with the UID (e.g. the hash name)