"Black Horse Redis Principles" 1. Data Structure

Redis Low-Level C Language

Data Structures#

Dynamic String SDS#

Redis does not use C language's String, as it is actually a character array.

Example:
char* s="hello" is {'h', 'e', 'l', 'l', 'o', '\0'}

Thus, Redis built a new string structure called Simple Dynamic String, abbreviated as SDS.

image.png|500

Here, reading does not depend on '\0' just to conform to C language; the actual usage is len.

image.png|500

When less than 1M, memory allocation increases quickly; after that, allocation is done in intervals of 1M, noting that memory allocation is very costly.

Binary safety means reading based on len, and it will not suddenly stop due to the appearance of '\0' in the middle.

IntSet#

IntSet is an implementation of a set collection in Redis, based on an integer array, with variable length and ordered characteristics.

image.png|500

|500

Encoding determines the size occupied by each element; here, 16 bits means 2 bytes, and length determines the number of elements stored in the array.

If the encoding is 16 but an element exceeding the encoding is added, elements are copied in reverse order to the correct position (reverse to avoid overwriting), then the new element is added, and finally, the encoding and length in the Header are modified.

image.png|500

IntSet can be seen as a special integer array with some characteristics:

Redis ensures that elements in IntSet are unique and ordered.
It has a type upgrade mechanism to save memory space.
It uses binary search for querying at the lower level.

Dict#

Dict key-value pairs, the size of the hash table is always $2^n$, and used may exceed size; at this point, a bucket will degrade into a linked list.

Union refers to any one of them.

image.png|500

sizemask = size - 1, binary is n ones.

image.png|500

Actual composition.

image.png|500

Expansion

The HashTable in Dict is a combination of arrays and linked lists, so when considering too much data to resolve data conflicts, it expands to the next power of $2^n$.

image.png|500

Shrinkage

Every time an element is deleted, it checks if loadFactor < 0.1, and if so, it shrinks to the next power of $2^n$, with a minimum of 4.

image.png|500

Rebuilding

Whether expanding or shrinking involves rebuilding, changing size and sizemask, so rehashing is necessary. This is why dict has ht[2], one in use and one empty.

image.png|500

Example of expansion and rebuilding.

image.png|500

|500

When the data volume is very large, it cannot be rebuilt all at once, so it is actually a progressive hash, slowly migrating. Previous data may be in ht[0]/ht[1], so new additions go into ht[1], while ht[0] only decreases.

image.png|500

ZipList#

ZipList is a special "double-ended linked list" composed of a series of specially encoded contiguous memory blocks. Operations can be pushed/popped from either end, with a time complexity of O(1).

image.png|500

Size.

image.png|500

You can obtain the starting address of the previous entry for reverse traversal by the length occupied by the previous entry, and the starting address of the next entry for forward traversal by calculating the length in the structure.

image.png|500

Note that Redis uses little-endian byte order here.
String type.

image.png|500

Integer types from 0-12 can omit content and directly save values in the last 4 bits of encoding.

image.png|500

ZipList's Chain Update Problem#

The main issue is that the prevlen field, which records the length of the previous entry, will change from 1 byte to 5 bytes when the length is greater than or equal to 254 bytes, leading to potential chain updates for the next entry's prevlen.

Inserting or modifying an entry1 in ZipList to be greater than or equal to 254 bytes means that the prevlen field in the following entry2 needs to change from 1 byte to 5 bytes to record the new length of entry1.

This change causes entry3 after entry2 to also need to modify prevlen3 to record the new length after entry2 changes. In extreme cases, if every entry's prevlen field changes from 1 byte to 5 bytes when they are all greater than or equal to 254 bytes, it can lead to a series of chain updates, causing insertion O(1) to degrade to O(n).

image.png|500

Why is it greater than or equal to 254 and not 255 bytes when prevlen uses 5 bytes?
Prevlen has 8 bits, which can represent 0-255, but 255, i.e., 0xff, is occupied by the special marker zlend, which marks the end of the compressed list ZipList, while prevlen happens to be the start of an entry. Using 255 as a boundary would conflict with the zlend end marker.

ZipList Characteristics

The compressed list can be seen as a "doubly linked list" of contiguous memory space.
The nodes in the list are not connected by pointers but by recording the lengths of the previous and current nodes for addressing, resulting in lower memory usage.
If the list data is too much, causing the linked list to become too long, it may affect query performance.
Adding or deleting large data may cause continuous update issues.

QuickList#

Problems with ZipList:

Saving front and back linked list pointers requires 16 bytes, while ZipList saves a lot of memory but requires contiguous memory space, making memory allocation efficiency relatively low; it limits ZipList length and entry size.
To store a large amount of data exceeding ZipList's optimal storage limit; multiple ZipList shards are created for storage.
After data splitting, it becomes too scattered, making management and searching inconvenient; QuickList introduces a doubly linked list, with nodes recording ZipList.

image.png|500

Configuration item list-max-ziplist-size=-2 defaults to a ZipList memory usage of no more than 8kb.

image.png|500

Configuration item for whether to compress nodes, i.e., the number of uncompressed nodes at both ends, explained as frequently using the head and tail nodes, while the middle nodes are not often used.

image.png|500

Here, one node at each end is not compressed, so the two middle ZipLists are re-encoded.

image.png|500

QuickList Characteristics:

It is a doubly linked list with nodes as ZipList; it has more memory fragmentation and less contiguous occupation.
Nodes use ZipList, solving the memory usage problem of traditional linked lists; ZipList saves memory.
It controls the size of ZipList, solving the efficiency problem of contiguous memory space allocation; it shards ZipList, avoiding the need for large contiguous memory.
Middle nodes can be compressed, further saving memory.

SkipList#

SkipList is primarily a linked list.

Elements are arranged in ascending order.
Nodes may contain multiple pointers with different spans; this is at least faster than traversing one by one, allowing up to 32.

image.png|500

Backward pointers point back, while forward pointers use the level[] array, with at least one span pointer, allowing for quick forward jumps and then step-by-step exploration to the correct node.

image.png|500

During searching, it retrieves from higher levels downwards, similar to binary search, trading space for time, storing nodes after certain spans in advance.

image.png|500

SkipList Characteristics:

SkipList is a doubly linked list, with each node containing a score and an element.
Nodes are sorted by score value; if scores are the same, they are sorted by element dictionary order.
Each node can contain multiple layers of pointers, with the number of layers being a random number between 1 and 32.
Different layer pointers have different spans to the next node; the higher the layer, the larger the span.
The efficiency of adding, deleting, modifying, and querying is basically consistent with red-black trees, but the implementation is simpler.

RedisObject#

Any data type's key and value in Redis are encapsulated as RedisObject, also known as Redis Object.

image.png|500

Type, encoding, and lru together occupy 32 bits, 4 bytes; refcount 4 bytes; *ptr 8 bytes, making the RedisObject header occupy 16 bytes.

For example, for String type, one key-value consumes one RedisObject header, which indeed takes up space.

Encoding method encoding.

image.png|500

Five Data Types#

image.png|500

What about BitMap and HyperLogLog?
These derived data structures are actually implemented based on the String data structure, not new types.
Also, GEO is based on SortedSet.

String#

String is the most common data storage type in Redis.

String RAW Encoding#

The basic encoding method for String is RAW, implemented based on Simple Dynamic String SDS, with a storage limit of 512mb.

image.png|500

String EmbStr Encoding#

If the stored SDS length is less than or equal to 44 bytes, EMBSTR encoding is used; at this point, the RedisObject Head and SDS are in a contiguous space; only one memory allocation function call is needed, making it more efficient; this is related to Redis's memory allocation algorithm; in short, if the SDS length is 44 bytes, Head + SDS is exactly 64 bytes, avoiding memory fragmentation.

image.png|500

String INT Encoding#

If the stored string is an integer value and its size is within the range of LONG_MAX, INT encoding is used, directly saving the data at the RedisObject's ptr pointer location, which is exactly 8 bytes, eliminating the need for SDS.

String raw, embstr, int encoding#

image.png|500

List#

The List data type can operate on elements at both ends, using a QuickList data structure that combines LinkedList + ZipList.

image.png|500

List Structure.

image.png|500

Set#

The Set data type is a single-column collection that does not guarantee order but ensures element uniqueness, allowing for intersection, union, and difference; for querying, SISMEMBER is very frequent, so HashTable is used.

image.png|500

Using HT encoding Dict, elements are stored as keys, with values uniformly set to null.
When all stored elements are integers and the number of elements does not exceed set-max-intset-entries, Set uses IntSet encoding to save memory.

image.png|500

If using IntSet exceeds the maximum number or includes non-integers, it converts to HT encoding Dict.

image.png|500

IntSet encoded Set.

image.png|500

HashTable encoded Set.

image.png|500

ZSET#

ZSet, or SortedSet, has each element with a score value and member value, which can be sorted by score, ensuring member uniqueness and allowing for score queries; the relationship between member and score is a key-value pair, meaning ZSet requires key-value storage, unique keys, and sortable, so it can use either SkipList or HashTable.

SkipList sorts by score, satisfying key-value storage and sortable requirements, but when querying score by member, it degrades to a linked list, and member uniqueness is also inconvenient.

HashTable sets key-value as member-score, but it cannot satisfy the sorting function by score.

Thus, both data structures are used: Dict ensures member uniqueness, while ZSet ensures sortable and range queries.

image.png|500

ZSet uses a combination of Dict and SkipList, with encoding set to SKIPLIST.

image.png|500

However, when there are not many elements, using Dict + ZipList can occupy a lot of memory and is not very obvious, so ZSet will use ZipList structure to save memory when the following conditions are met:

The number of elements is less than zset_max_ziplist_entries, with a default value of 128.
Each element is less than zset_max_ziplist_value bytes, with a default value of 64.

When initializing, if both initial values meet these two conditions, then ZipList is used.

image.png|500

When adding elements, if it is ZipList, it will check whether to convert to SkipList + HashTable structure.

image.png|500

However, ZipList does not meet ZSet's three requirements; why can it use ZipList for small data and small data volumes?
It can simulate business logic to support sorting, key-value pair sorting, and continuous memory space, with element/member in front and value/score behind.

image.png|500

Hash#

The Hash structure is similar to the ZSet structure, requiring key-value storage, unique keys, and allowing values to be obtained by keys; it must be a hashTable.

Differences from ZSet:

In ZSet, the key is member, and the value is score; in Hash, both key and value can be arbitrary values.
ZSet requires sorting by score; Hash does not require sorting.

The Hash structure defaults to using ZipList encoding to save memory, with adjacent entries in ZipList storing field and value respectively.
When the data volume is large, the Hash structure will convert to HT encoding, i.e., Dict, triggered by two conditions:

The number of elements in the zipList exceeds hash-max-ziplist-entries, defaulting to 512.
Any entry size in the ZipList exceeds hash-max-ziplist-value, defaulting to 64 bytes.

image.png|500

Breaking any one of these conditions will convert to HT encoding.

image.png|500

Hash source code shows that it is initialized as ZipList by default.

image.png|500

Then, when trying to add, it checks whether to convert to HT encoding based on element size.

image.png|500

Finally, during addition, it checks for equality; if it is ZipList, it checks the length after insertion, and if it exceeds, it converts to HT encoding.

image.png|500

This article is synchronized and updated to xLog by Mix Space. The original link is https://blog.0xling.cyou/posts/redis/redis-3