Overview
SableDB uses a Key / Value database for its underlying data storage. We chose to use RocksDB
as its mature, maintained and widely used in the industry by giant companies.
Because the RocksDB is key-value storage and Redis data structures can be more complex, an additional
data encoding is required.
This chapter covers how SableDB encodes the data for the various data types (e.g. String, Hash, Set etc)
Note
Numbers are encoded using Big Endians to preserve lexicographic ordering
SableDB takes advantage of the following RocksDB traits:
RocksDBkeys are stored lexicographically (this is whySableDBuses big-endiands)RocksDBprovides prefix iterators which allowsSableDBto place iterator on the first item that matches a prefix
The String data type
The most basic data type in SableDB is the String data type. Strings in SableDB are always binary safe
Each String record in the SableDB consists of a single entry in RocksDB:
A B C D E F G H
+-----+-----+-------+----------+ +-----+------------+----+-------+
| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |
+-----+-----+-------+----------+ +-----+------------+----+-------+
The key for a String record is encoded as follows:
Athe first byte (u8) is always set to1- this indicates that this is a data entry (there are other type of keys in the database)Bthe database ID is encoded asu16(this implies thatSableDBsupports up to64Kdatabases)Cthe slot numberDthe actual key value (e.g.set mykey myvalue->mykeyis set here)
The value is encoded as follows:
Ethe first byte is the type bit, value of0means that the this record is of typeStringFthe record expiration infoGunique ID (relevant for complex types likeHash), forStringthis is always0Hthe user value
Using the above encoding, we can now understand how SableDB reads from the database. Lets have a look a the command:
get mykey
SableDB encodes a key from the user key (mykey) by prepending the following:
1u8 - to indicate that this is the data record- The active database number (defaults to
0) - The slot number
- The user string key (i.e.
mykey)
This is the key that is passed to RocksDB for reading
- If the key exists in the database:
- If the type (field E) is != 0 - i.e. the entry is not a String, SableDB returns a -WRONGTYPE error
- If value is expired -> SableDB returns null and deletes the record from the database
- Otherwise, SableDB returns the H part of the value (the actual user data)
- Else (no such key) return null
The List data type
A List is a composite data type. SableDB stores the metadata of the list using a dedicated record and each list element
is stored in a separate entry.
List metadata:
A B C D
+-----+---- +--------+------------+
| 1u8 | DB# | Slot# | list name |
+-----+---- +--------+------------+
E F G H I J
+-----+------------+--------- +------+------+-------+
=> | 1u8 | Expirtaion | List UID | head | tail | size |
+-----+------------+--------- +------+------+-------+
List item:
K L M N O P
+-----+--------------+---------------+ +------+--------+------------+
| 2u8 | List ID(u64) | Item ID(u64) | => | Left | Right | value |
+-----+--------------+---------------+ +------+--------+------------+
Unlike String, a List is using an additional entry in the database that holds the list metadata.
- Encoded items
A->Dare the same asString Ethe first byte is always set to1(unlikeStringwhich is set to0)FExpiration infoGThe list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)Hthe UID of the list head item (u64)Ithe UID of the list tail item (u64)Jthe list length
In addition to the list metadata (SableDB keeps a single metadata item per list) we add a list item per new list item
using the following encoding:
Kthe first bit which is always set to2("List Item")Lthe parent list ID (see fieldGabove)Mthe item UIDNthe UID of the previous item in the list (0means that this item is the head)Othe UID of the next item in the list (0means that this item is the last item)Pthe list value
The above encoding allows SableDB to iterate over all list items by creating a RocksDB iterator and move it to
the prefix [ 2 | <list-id>] (2 indicates that only list items should be scanned, and list-id makes sure that only
the requested list items are visited)
The Hash data type
Hash items are encoded using the following:
Hash metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Hash item:
P Q R S
+-----+--------------+-------+ +-------+
| 3u8 | Hash ID(u64) | field | => | value |
+-----+--------------+-------+ +-------+
- Encoded items
A->Hare basically identical to the hashA->Hfields Palways set to3("hash member")Qthe hash ID for which this member belongs toRthe hash fieldSthe field's value
The Sorted Set data type
The sorted set ( Z* commands) is encoded using the following:
Sorted set metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
ZSet item 1 (Index: "Find by member"):
K L M O
+-----+--------------+---------+ +-------+
| 4u8 | ZSet ID(u64) | member | => | score |
+-----+--------------+---------+ +-------+
ZSet item 2 (Index: "Find by score"):
P Q R S T
+-----+--------------+-------+-------+ +------+
| 5u8 | ZSet ID(u64) | score |member | => | null |
+-----+--------------+-------+-------+ +------+
Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.
The zset metadata contains:
- Encoded items
A->Dare the same asString Ewill always contains3forsorted setFthe expiration infoGthe unique zset IDHthe set size (number of members)
Each zset item are kept using 2 records:
Index: "Find by member"
The first record allows SableDB to find a member score (the key is the member value)
Kthe first bit which is always set to4("ZSet member Item")Lthe zset ID for which this item belongs toMthe zset memberOthis member score value
Index: "Find by score"
The second record, allows SableDB to find member by score (we use the score as the key)
Pthe first bit is always set to5("Zset score item")Qthe zset ID for which this item belongs toRthe record's score valueSthe memberTnot used
The above encoding records provides all the indexing required by SableDB to implement the sorted set commands.
For example, in order to implement the command ZCOUNT (Returns the number of elements in the sorted set at key with a score between min and max):
SableDBfirst loads the metadata using the zset key in order to obtain its unique ID- Creates an iterator using the prefix
[5 | ZSET UID | MIN_SCORE](Index: "Find by score") - Start iterating until it either finds the first entry that does not belong to the zset, or it finds the
MAX_SCOREvalue
The Set data type
Set items are encoded using the following:
Set metadata:
A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Set item:
P Q R S
+-----+--------------+-------+ +------+
| 6u8 | Set ID(u64) | field | => | null |
+-----+--------------+-------+ +------+
- Encoded items
A->Hare basically identical to the sorted setA->Hfields Palways set to6("set member")Qthe set ID for which this member belongs toRthe set fieldSnull (not used)
Bookkeeping records
Every composite item (Hash, Sorted Set, List or Set) created by SableDB, also creates a record in the bookkeeping "table".
A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
The bookkeeping record is encoded as follows:
Bookkeeping:
A B C D E
+-----+----+--------+-----------+ +----------+
| 0u8 | UID| DB# | UID type | => | user key |
+-----+----+--------+-----------+ +----------+
Aa bookkeeping records starts with0Bau64field containing the composite item UID (e.g.Hash UID)Cthe database ID for which the UID belongs toDthe UID type when it was created (e.g. "hash" or "set")Ethe user key associated with the UID (e.g. the hash name)