Redis存储结构探究

一、概述

主流的key-value存储系统，都是在系统内部维护一个hash表，因为对hash表的操作时间复杂度为O(1)。如果数据增加以后，导致冲突严重，时间复杂度增加，则可以对hash表进行rehash，以来保证操作的常量时间复杂度。

那么，对于这样一个基于hash表的key-value存储系统，提出以下2个问题：

1）如何提供这么丰富的数据结构呢？

2）数据结构在内存中如何存储？

本文，我们将讲解、演示redis的内存布局和数据结构。

二、hash算法

思考题：如果我们有一个k-v系统，并把该系统设计成链表，比如：

现在我们需要在这个链表中查找“biying”的url，那么就需要遍历整个链表，然后与每个链表中的key进行比对，这样做的效率比较低，是一个O（n）的复杂度。假设我们有10W个key，链表的存储方式是不适用的。这个时候应该怎么办呢？

hash

hash即给定一个字符串或者其他的任意值x，通过hash函数得到一个散列值hash（x），比如给值“biying”，使用MurmurHash64B得到的hash值就是：7740020151228311510。

hash表可以简单理解为一个数组：

Redis存储结构探究

上述hash表设计引出以下问题：

1）如果通过索引（hash值）去读取hash表，这样设计的hash表会非常大，占用较大内存。

2）如果是通过hash值遍历hash表，如果k-v数量很多，则查找性能会是O(N)，则时间性能也很低。

另外，hash会有碰撞，即使再好的算法也会有碰撞，只是碰撞率低而已。redis是如何解决这些问题的呢？

三、redis字典

3.1 redis的hash表

介绍hash字典前，先了解redis的hash表：

/* This is our hash table structure. Every dictionary has two of this as we
* implement incremental rehashing, for the old to the new table.*/
typedef struct dictht {
	dictEntry **table;
	unsigned long size;
	unsigned long sizemask;
	unsigned long used;
} dictht;

table 属性是一个数组，数组中的每个元素都是一个指向 dict.h/dictEntry 结构的指针，每个dictEntry 结构保存着一个键值对；
size 属性记录了哈希表的大小，也即是 table 数组的大小，而 used 属性则记录了哈希表目前已有节点（键值对）的数量；
sizemask 属性的值总是等于 size - 1 ，这个属性和哈希值一起决定一个键应该被放到 table 数组的哪个索引上面；
used属性，表示hash表里已有的数量。

空hash表示意图：

Redis存储结构探究

3.2 hash表的节点

hash表节点使用dicEntry结构表示，每个dicEntry结构都保存一个键值对。

typedef struct dictEntry {
	// 键
	void *key;
	
	// 值
	union {
		void *val;
		uint64_t u64;
		int64_t s64;
		double d;
	} v;
	
	// 指向下个哈希表节点，形成链表
	struct dictEntry *next;
} dictEntry;

key：保存键值对的键
v：保存键值对的值，键值对的值可以是一个指针、整数或double类型。
next：指向另一个hash节点的指针，可以将多个哈希值相同的键值对连接在一起，用来解决冲突问题。

下图展示如何通过next指针，将两个索引相同的键k1和k0连接在一起。

Redis存储结构探究

3.3 redis字典

定义：

typedef struct dict {
	// 字典类型
	dictType *type;
	
	// 字典的私有数据
	void *privdata;
	
	// 哈希表，二维的，默认使用ht[0],当需要进行rehash的时候，会用ht[1]进行
	dictht ht[2];
	
	// rehash的索引，当没有进行rehash时其值为-1
	long rehashidx; /* rehashing not in progress if rehashidx ==-1 */
	
	// hash表的迭代器，一般用于rehash和主从复制等等
	 unsigned long iterators; /* number of iterators currently running */
} dict;

type 属性和 privdata 属性是针对不同类型的键值对，为创建多态字典设置的：

type 属性是指向 dictType 结构的指针，每个 dictType 结构保存了一簇用于操作特定类型键值对的函数， Redis 会为用途不同的字典设置不同的类型特定函数。
privdata 属性则保存了需要传给那些类型特定函数的可选参数。

typedef struct dictType {
	uint64_t (*hashFunction)(const void *key);
	void *(*keyDup)(void *privdata, const void *key);
	void *(*valDup)(void *privdata, const void *obj);
	int (*keyCompare)(void *privdata, const void *key1, const void*key2);
	void (*keyDestructor)(void *privdata, void *key);
	void (*valDestructor)(void *privdata, void *obj);
} dictType;

ht 属性是一个包含两个项的数组，数组中的每个项都是一个 dictht 哈希表，一般情况下，字典只使用ht[0] 哈希表， ht[1] 哈希表只会在对 ht[0] 哈希表进行rehash 时使用。

除了 ht[1] 之外，另一个和 rehash 有关的属性就是 rehashidx ：它记录了 rehash 当前的进度，如果目前没有在进行rehash ，那么它的值为 -1 。如下图展示了一个普通的字典：

Redis存储结构探究

redis是如何解决时间效率和空间效率的？

初始时，字典的hash表的大小只有4（sizemask为3,），则通过hash函数计算出的hash值可能会很大，此时hash值会与上（&）sizemask，得到存储在hash表里的table[index]，序号的，见如下源码：

# 使用字典设置的哈希函数，计算键 key 的哈希值
hash = dict->type->hashFunction(key);

# 使⽤哈希表的 sizemask 属性和哈希值，计算出索引值
# 根据情况不同， ht[x] 可以是 ht[0] 或者 ht[1]
index = hash & dict->ht[x].sizemask;

以前文提到的4个URL为例，它们是如何存储到一个空的hash表呢？

Redis存储结构探究

插入到字典后，字典结构如下：

Redis存储结构探究

总结：

hash表是随着K-V数量的增大而逐步增大的，并不直接以key的hash值为下标去取值得，而是以hash & sizemask去获取hash表的对应节点的；hash表的节点实际上是⼀个链表，如果hash &sizemask有冲突，则也把冲突key放在hash表的链表上，取值得时候还得遍历hash表里的链表。

四、redis数据库的结构

在redis的内部，redisServer结构体的全局变量server，server保存了redis服务端所有的信息，包括当前进程的PID、服务器的端口号、数据库个数、统计信息等等。当然，它也包含了数据库信息，包括数据库的个数、以及一个redisDb数组。

struct redisServer {
	 ...
	redisDb *db;
	int dbnum; /* Total number of configured DBs */
	...

dbnum就是redisDb数组的长度，每个数据库，都对应于一个redisDb，在redis的客户端中，可以通过select N来选择使用哪个数据库，各个数据库之间互相独立。例如：可以在不同的数据库中同时存在名为”0voice”的key。

从上面的分析中可以看到，server是全局变量，它包含了若干个redisDb，每个redisDb是一keyspace，各个keyspace互相独立，互不干扰。

redisDb的定义：

/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
	dict *dict; /* The keyspace for this DB */
	dict *expires; /* Timeout of keys with a timeoutset */
	dict *blocking_keys; /* Keys with clients waiting fordata (BLPOP)*/
	dict *ready_keys; /* Blocked keys that received a PUSH */
	dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
	int id; /* Database ID */
	long long avg_ttl; /* Average TTL, just for stats */
	list *defrag_later; /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;

redis的每个数据库是独立的keyspace，因此，我们理所当然的认为，redis的数据库是一个hash表。

但是，从redisDb的定义来看，它并不是hash表，而是一个包含了很多hash表的结构。

之所以这样做，是因为redis还需要提供除了set、get以外更加丰富的功能(例如：键的超时机制)。

五、rehash

随着操作的不断执行，哈希表保存的键值对会逐渐地增多或者减少，为了让哈希表的负载因子（ratio）维持在合理的范围之内，当哈希表保存的键值对数量太多或者太少时，程序需要对哈希表的大小进行相应的扩展或者收缩。

ratio = ht[0].used / ht[0].size

比如，hash表的size为4，如果已经插入了4个k-v的话，则ratio 为 1

扩展和收缩哈希表的工作可以通过执行rehash （重新散列）操作来完成， Redis 对字典的哈希表执行rehash 的策略如下：

1）如果ratio小于0.1，则会对hash表进行收缩操作

/* If the percentage of used slots in the HT reaches HASHTABLE_MIN_FILL
 * we resize the hash table to save memory */

void tryResizeHashTables(int dbid) {
	if (htNeedsResize(server.db[dbid].dict))
		dictResize(server.db[dbid].dict);
	if (htNeedsResize(server.db[dbid].expires))
		dictResize(server.db[dbid].expires);
}

int htNeedsResize(dict *dict) {
	long long size, used;
	size = dictSlots(dict);
	 used = dictSize(dict);
	return (size > DICT_HT_INITIAL_SIZE && (used*100/size < HASHTABLE_MIN_FILL));
}

2）服务器目前没有在执行 BGSAVE 命令或者 BGREWRITEAOF 命令，并且哈希表的负载因子大于等于 1 ，则扩容hash表，扩大小为当前ht[0].used*2。

3）服务器⽬前正在执⾏行BGSAVE 命令或者 BGREWRITEAOF 命令，并且哈希表的负载因子大于等于 5，则扩容hash表，并且扩容大小为当前ht[0].used*2。

/* Expand the hash table if needed */
static int _dictExpandIfNeeded(dict *d)
{
	 /* Incremental rehashing already in progress. Return. */
	if (dictIsRehashing(d)) return DICT_OK;
	
	/* If the hash table is empty expand it to the initial size.
	*/
	 if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);
	 
	/* If we reached the 1:1 ratio, and we are allowed to resizethe hash
	* table (global setting) or we should avoid it but the ratiobetween
	* elements/buckets is over the "safe" threshold, we resize doubling
	* the number of buckets. */
	if (d->ht[0].used >= d->ht[0].size &&
	(dict_can_resize ||d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
	{
		return dictExpand(d, d->ht[0].used*2);
	}
	return DICT_OK;
}

其实上文说的扩容为ht[0].uesd*2 是不严谨的，实际上是刚好大于该书的2的N次方：

unsigned long realsize = _dictNextPower(size);

/* Our hash table capability is a power of two */
static unsigned long _dictNextPower(unsigned long size)
{
	unsigned long i = DICT_HT_INITIAL_SIZE;
	if (size >= LONG_MAX) return LONG_MAX + 1LU;
	
	while(1) {
		if (i >= size)
			return i;
		i *= 2;
	}
}

扩容的步骤如下：

为字典ht[1]哈希表分配合适的空间；
将ht[0]中所有的键值对rehash到ht[1]：rehash 指的是重新计算键的哈希值和索引值，然后将键值对放置到 ht[1] 哈希表的指定位置上；
当 ht[0] 包含的所有键值对都迁移到了 ht[1] 之后（ht[0] 变为空表），释放 ht[0] ，将 ht[1] 设置为 ht[0] ，并在 ht[1] 新创建一个空白哈希表，为下一次 rehash 做准备。。

六、渐进式rehash

上节说过，扩展或收缩哈希表需要将 ht[0] 的所有键值对 rehash 到 ht[1] ，但是，这个rehash 动作并不一次性、集中式地完成的，而是分多次、渐进式地完成的。

这样做的原因在于，如果 ht[0] 只保存着四个键值对，那么服务器可以在瞬间就将这些键值对全部rehash 到 ht[1] ；但是，如果哈希表保存的键值对数量不是四个，而是四百万、四千万甚至四亿个键值对，那么要一次性将这些键值对全部 rehash 到 ht[1] 的话，庞大的计算量可能会导致服务器在一段时间内停止服务。

因此，为了避免 rehash 对服务器性能造成影响，服务器不是一次性将 ht[0] 的所有键值对全部rehash 到 ht[1] ，而是分多次、渐进式地将 ht[0] 的键值对慢慢地 rehash 到 ht[1] 。

以下是哈希表渐进式 rehash 的详细步骤：

为 ht[1] 分配空间，让字典同时持有 ht[0] 和 ht[1] 两个哈希表。
在字典中维持一个索引计数器变量 rehashidx ，并将它的值设置为 0 ，表示 rehash 工作正式开始。
在 rehash 进行期间，每次对字典执行添加、删除、查找或者更新操作时，程序除了执⾏指定的操作以外，还会顺带将 ht[0] 哈希表在 rehashidx 索引上的所有键值对 rehash 到 ht[1] ，当 rehash 工作完成之后，程序将 rehashidx 属性的值增1。
随着字典操作的不断执行，最终在某个时间点上， ht[0] 的所有键值对都会被 rehash 至 ht[1] ，这时程序将 rehashidx 属性的值设为 -1 ，表示 rehash 操作已完成。

渐进式 rehash 的好处在于它采取分而治之的方式，将 rehash 键值对所需的计算工作均滩到对字典的每个添加、删除、查找和更新操作上，甚至是后台启一个定时器，每次时间循环时只工作作1毫秒，从而避免了集中式 rehash 而带来的庞大计算量。

/* This function handles 'background' operations we are requiredto do
* incrementally in Redis databases, such as active key expiring,resizing,
* rehashing. */
void databasesCron(void) {
	 ...
	
	/* Perform hash tables rehashing if needed, but only if thereare no
	* other processes saving the DB on disk. Otherwise rehashingis bad
	 * as will cause a lot of copy-on-write of memory pages. */
	 if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
	{
		 /* Resize */
		for (j = 0; j < dbs_per_call; j++) {
			tryResizeHashTables(resize_db % server.dbnum);
			resize_db++;
		}
		/* Rehash */
		if (server.activerehashing) {
			 for (j = 0; j < dbs_per_call; j++) {
				int work_done = incrementallyRehash(rehash_db);
				if (work_done) {
					 /* If the function did some work, stop here,we'll do
					* more at the next cron loop. */
					break;
				 } else {
					/* If this db didn't need rehash, we'll try the next one. */
					rehash_db++;
					rehash_db %= server.dbnum;
				 }
			}
	 	}
	}
}

/* Our hash table implementation performs rehashing incrementallywhile
* we write/read from the hash table. Still if the server is idle, the hash
 * table will use two tables for a long time. So we try to use 1millisecond
* of CPU time at every call of this function to perform some rehahsing.
*
* The function returns 1 if some rehashing was performed, otherwise 0
* is returned. */
 int incrementallyRehash(int dbid) {
	/* Keys dictionary */
	if (dictIsRehashing(server.db[dbid].dict)) {
	 	dictRehashMilliseconds(server.db[dbid].dict,1);
		return 1; /* already used our millisecond for this loop... */
	 }
	 /* Expires */
	if (dictIsRehashing(server.db[dbid].expires)) {
		dictRehashMilliseconds(server.db[dbid].expires,1);
	 	return 1; /* already used our millisecond for this loop... */
	}
	 return 0;
}

七、redisObject

hash表里的一个个对象都是什么呢？

typedef struct redisObject {
	unsigned type:4;
	unsigned encoding:4;
	unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
	* LFU data (least significant 8 bits frequency
	 * and most significant 16 bits accesstime). */
	 int refcount;
	 void *ptr;
} robj;

redis的对象有11种，之多，他们全都继承于redisObject。

/* Objects encoding. Some kind of objects like Strings and Hashes can be
* internally represented in multiple ways. The 'encoding' fieldof the object
* is set to one of this fields for this object. */
#define OBJ_ENCODING_RAW 0 /* Raw representation */
#define OBJ_ENCODING_INT 1 /* Encoded as integer */
#define OBJ_ENCODING_HT 2 /* Encoded as hash table */
#define OBJ_ENCODING_ZIPMAP 3 /* Encoded as zipmap */
#define OBJ_ENCODING_LINKEDLIST 4 /* No longer used: old list encoding. */
#define OBJ_ENCODING_ZIPLIST 5 /* Encoded as ziplist */
#define OBJ_ENCODING_INTSET 6 /* Encoded as intset */
#define OBJ_ENCODING_SKIPLIST 7 /* Encoded as skiplist */
#define OBJ_ENCODING_EMBSTR 8 /* Embedded sds string encoding */
#define OBJ_ENCODING_QUICKLIST 9 /* Encoded as linked list of ziplists */
#define OBJ_ENCODING_STREAM 10 /* Encoded as a radix tree of listpacks */

Redis存储结构探究

一、概述

二、hash算法

三、redis字典

3.1 redis的hash表

3.2 hash表的节点

3.3 redis字典

四、redis数据库的结构

五、rehash

六、渐进式rehash

七、redisObject

继续阅读

秒懂JVM的三大参数类型，就靠这十个小实验了初级—中级—高级三个级别的大厂面试真题阿里云——Java 实习生/初级美团——Java 中级蚂蚁金服——Java 高级基础篇JVM 篇MySQL 篇Redis 篇

CentOs7 安装redis4.0 遇到的坑一、CentOs7连接网络二、make文件的时候出错三、安装ruby的redis插件时候报错最后

docker 搭建Redis 问题解决

如何解决Redis缓存击穿、雪崩、穿透问题

java 版本的redis-stat不能运行在后台和daemon

redis管理常用命令

django短信验证码的后端实现

Redis订阅了一段时间后订阅失效了（ redisTemplate.convertAndSend）

2022秋招面试总结（cpp+java+测开）百度测开一面字节后端一面虾皮后端一面虾皮后端二面

数据迁移方法数据迁移原则数据迁移之双写方案数据迁移之级联同步方案

微服务-性能压测\缓存redis和分布式锁redisson和SpringCache

Nacos 2.0 升级前后性能对比压测

Spring数据和Redis

redis集群数据一致性_RedisRaft为Redis集群带来强大的数据一致性

supervisor 管理redis 和httpd 环境centos7

Redis简介一(单机版)发展历程Redis