Redis存儲結構探究

一、概述

主流的key-value存儲系統，都是在系統内部維護一個hash表，因為對hash表的操作時間複雜度為O(1)。如果資料增加以後，導緻沖突嚴重，時間複雜度增加，則可以對hash表進行rehash，以來保證操作的常量時間複雜度。

那麼，對于這樣一個基于hash表的key-value存儲系統，提出以下2個問題：

1）如何提供這麼豐富的資料結構呢？

2）資料結構在記憶體中如何存儲？

本文，我們将講解、示範redis的記憶體布局和資料結構。

二、hash算法

思考題：如果我們有一個k-v系統，并把該系統設計成連結清單，比如：

現在我們需要在這個連結清單中查找“biying”的url，那麼就需要周遊整個連結清單，然後與每個連結清單中的key進行比對，這樣做的效率比較低，是一個O（n）的複雜度。假設我們有10W個key，連結清單的存儲方式是不适用的。這個時候應該怎麼辦呢？

hash

hash即給定一個字元串或者其他的任意值x，通過hash函數得到一個散列值hash（x），比如給值“biying”，使用MurmurHash64B得到的hash值就是：7740020151228311510。

hash表可以簡單了解為一個數組：

Redis存儲結構探究

上述hash表設計引出以下問題：

1）如果通過索引（hash值）去讀取hash表，這樣設計的hash表會非常大，占用較大記憶體。

2）如果是通過hash值周遊hash表，如果k-v數量很多，則查找性能會是O(N)，則時間性能也很低。

另外，hash會有碰撞，即使再好的算法也會有碰撞，隻是碰撞率低而已。redis是如何解決這些問題的呢？

三、redis字典

3.1 redis的hash表

介紹hash字典前，先了解redis的hash表：

/* This is our hash table structure. Every dictionary has two of this as we
* implement incremental rehashing, for the old to the new table.*/
typedef struct dictht {
	dictEntry **table;
	unsigned long size;
	unsigned long sizemask;
	unsigned long used;
} dictht;

table 屬性是一個數組，數組中的每個元素都是一個指向 dict.h/dictEntry 結構的指針，每個dictEntry 結構儲存着一個鍵值對；
size 屬性記錄了哈希表的大小，也即是 table 數組的大小，而 used 屬性則記錄了哈希表目前已有節點（鍵值對）的數量；
sizemask 屬性的值總是等于 size - 1 ，這個屬性和哈希值一起決定一個鍵應該被放到 table 數組的哪個索引上面；
used屬性，表示hash表裡已有的數量。

空hash表示意圖：

Redis存儲結構探究

3.2 hash表的節點

hash表節點使用dicEntry結構表示，每個dicEntry結構都儲存一個鍵值對。

typedef struct dictEntry {
	// 鍵
	void *key;
	
	// 值
	union {
		void *val;
		uint64_t u64;
		int64_t s64;
		double d;
	} v;
	
	// 指向下個哈希表節點，形成連結清單
	struct dictEntry *next;
} dictEntry;

key：儲存鍵值對的鍵
v：儲存鍵值對的值，鍵值對的值可以是一個指針、整數或double類型。
next：指向另一個hash節點的指針，可以将多個哈希值相同的鍵值對連接配接在一起，用來解決沖突問題。

下圖展示如何通過next指針，将兩個索引相同的鍵k1和k0連接配接在一起。

Redis存儲結構探究

3.3 redis字典

定義：

typedef struct dict {
	// 字典類型
	dictType *type;
	
	// 字典的私有資料
	void *privdata;
	
	// 哈希表，二維的，預設使用ht[0],當需要進行rehash的時候，會用ht[1]進行
	dictht ht[2];
	
	// rehash的索引，當沒有進行rehash時其值為-1
	long rehashidx; /* rehashing not in progress if rehashidx ==-1 */
	
	// hash表的疊代器，一般用于rehash和主從複制等等
	 unsigned long iterators; /* number of iterators currently running */
} dict;

type 屬性和 privdata 屬性是針對不同類型的鍵值對，為建立多态字典設定的：

type 屬性是指向 dictType 結構的指針，每個 dictType 結構儲存了一簇用于操作特定類型鍵值對的函數， Redis 會為用途不同的字典設定不同的類型特定函數。
privdata 屬性則儲存了需要傳給那些類型特定函數的可選參數。

typedef struct dictType {
	uint64_t (*hashFunction)(const void *key);
	void *(*keyDup)(void *privdata, const void *key);
	void *(*valDup)(void *privdata, const void *obj);
	int (*keyCompare)(void *privdata, const void *key1, const void*key2);
	void (*keyDestructor)(void *privdata, void *key);
	void (*valDestructor)(void *privdata, void *obj);
} dictType;

ht 屬性是一個包含兩個項的數組，數組中的每個項都是一個 dictht 哈希表，一般情況下，字典隻使用ht[0] 哈希表， ht[1] 哈希表隻會在對 ht[0] 哈希表進行rehash 時使用。

除了 ht[1] 之外，另一個和 rehash 有關的屬性就是 rehashidx ：它記錄了 rehash 目前的進度，如果目前沒有在進行rehash ，那麼它的值為 -1 。如下圖展示了一個普通的字典：

Redis存儲結構探究

redis是如何解決時間效率和空間效率的？

初始時，字典的hash表的大小隻有4（sizemask為3,），則通過hash函數計算出的hash值可能會很大，此時hash值會與上（&）sizemask，得到存儲在hash表裡的table[index]，序号的，見如下源碼：

# 使用字典設定的哈希函數，計算鍵 key 的哈希值
hash = dict->type->hashFunction(key);

# 使⽤哈希表的 sizemask 屬性和哈希值，計算出索引值
# 根據情況不同， ht[x] 可以是 ht[0] 或者 ht[1]
index = hash & dict->ht[x].sizemask;

以前文提到的4個URL為例，它們是如何存儲到一個空的hash表呢？

Redis存儲結構探究

插入到字典後，字典結構如下：

Redis存儲結構探究

總結：

hash表是随着K-V數量的增大而逐漸增大的，并不直接以key的hash值為下标去取值得，而是以hash & sizemask去擷取hash表的對應節點的；hash表的節點實際上是⼀個連結清單，如果hash &sizemask有沖突，則也把沖突key放在hash表的連結清單上，取值得時候還得周遊hash表裡的連結清單。

四、redis資料庫的結構

在redis的内部，redisServer結構體的全局變量server，server儲存了redis服務端所有的資訊，包括目前程序的PID、伺服器的端口号、資料庫個數、統計資訊等等。當然，它也包含了資料庫資訊，包括資料庫的個數、以及一個redisDb數組。

struct redisServer {
	 ...
	redisDb *db;
	int dbnum; /* Total number of configured DBs */
	...

dbnum就是redisDb數組的長度，每個資料庫，都對應于一個redisDb，在redis的用戶端中，可以通過select N來選擇使用哪個資料庫，各個資料庫之間互相獨立。例如：可以在不同的資料庫中同時存在名為”0voice”的key。

從上面的分析中可以看到，server是全局變量，它包含了若幹個redisDb，每個redisDb是一keyspace，各個keyspace互相獨立，互不幹擾。

redisDb的定義：

/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
	dict *dict; /* The keyspace for this DB */
	dict *expires; /* Timeout of keys with a timeoutset */
	dict *blocking_keys; /* Keys with clients waiting fordata (BLPOP)*/
	dict *ready_keys; /* Blocked keys that received a PUSH */
	dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
	int id; /* Database ID */
	long long avg_ttl; /* Average TTL, just for stats */
	list *defrag_later; /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;

redis的每個資料庫是獨立的keyspace，是以，我們理所當然的認為，redis的資料庫是一個hash表。

但是，從redisDb的定義來看，它并不是hash表，而是一個包含了很多hash表的結構。

之是以這樣做，是因為redis還需要提供除了set、get以外更加豐富的功能(例如：鍵的逾時機制)。

五、rehash

随着操作的不斷執行，哈希表儲存的鍵值對會逐漸地增多或者減少，為了讓哈希表的負載因子（ratio）維持在合理的範圍之内，當哈希表儲存的鍵值對數量太多或者太少時，程式需要對哈希表的大小進行相應的擴充或者收縮。

ratio = ht[0].used / ht[0].size

比如，hash表的size為4，如果已經插入了4個k-v的話，則ratio 為 1

擴充和收縮哈希表的工作可以通過執行rehash （重新散列）操作來完成， Redis 對字典的哈希表執行rehash 的政策如下：

1）如果ratio小于0.1，則會對hash表進行收縮操作

/* If the percentage of used slots in the HT reaches HASHTABLE_MIN_FILL
 * we resize the hash table to save memory */

void tryResizeHashTables(int dbid) {
	if (htNeedsResize(server.db[dbid].dict))
		dictResize(server.db[dbid].dict);
	if (htNeedsResize(server.db[dbid].expires))
		dictResize(server.db[dbid].expires);
}

int htNeedsResize(dict *dict) {
	long long size, used;
	size = dictSlots(dict);
	 used = dictSize(dict);
	return (size > DICT_HT_INITIAL_SIZE && (used*100/size < HASHTABLE_MIN_FILL));
}

2）伺服器目前沒有在執行 BGSAVE 指令或者 BGREWRITEAOF 指令，并且哈希表的負載因子大于等于 1 ，則擴容hash表，擴大小為目前ht[0].used*2。

3）伺服器⽬前正在執⾏行BGSAVE 指令或者 BGREWRITEAOF 指令，并且哈希表的負載因子大于等于 5，則擴容hash表，并且擴容大小為目前ht[0].used*2。

/* Expand the hash table if needed */
static int _dictExpandIfNeeded(dict *d)
{
	 /* Incremental rehashing already in progress. Return. */
	if (dictIsRehashing(d)) return DICT_OK;
	
	/* If the hash table is empty expand it to the initial size.
	*/
	 if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);
	 
	/* If we reached the 1:1 ratio, and we are allowed to resizethe hash
	* table (global setting) or we should avoid it but the ratiobetween
	* elements/buckets is over the "safe" threshold, we resize doubling
	* the number of buckets. */
	if (d->ht[0].used >= d->ht[0].size &&
	(dict_can_resize ||d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
	{
		return dictExpand(d, d->ht[0].used*2);
	}
	return DICT_OK;
}

其實上文說的擴容為ht[0].uesd*2 是不嚴謹的，實際上是剛好大于該書的2的N次方：

unsigned long realsize = _dictNextPower(size);

/* Our hash table capability is a power of two */
static unsigned long _dictNextPower(unsigned long size)
{
	unsigned long i = DICT_HT_INITIAL_SIZE;
	if (size >= LONG_MAX) return LONG_MAX + 1LU;
	
	while(1) {
		if (i >= size)
			return i;
		i *= 2;
	}
}

擴容的步驟如下：

為字典ht[1]哈希表配置設定合适的空間；
将ht[0]中所有的鍵值對rehash到ht[1]：rehash 指的是重新計算鍵的哈希值和索引值，然後将鍵值對放置到 ht[1] 哈希表的指定位置上；
當 ht[0] 包含的所有鍵值對都遷移到了 ht[1] 之後（ht[0] 變為空表），釋放 ht[0] ，将 ht[1] 設定為 ht[0] ，并在 ht[1] 新建立一個空白哈希表，為下一次 rehash 做準備。。

六、漸進式rehash

上節說過，擴充或收縮哈希表需要将 ht[0] 的所有鍵值對 rehash 到 ht[1] ，但是，這個rehash 動作并不一次性、集中式地完成的，而是分多次、漸進式地完成的。

這樣做的原因在于，如果 ht[0] 隻儲存着四個鍵值對，那麼伺服器可以在瞬間就将這些鍵值對全部rehash 到 ht[1] ；但是，如果哈希表儲存的鍵值對數量不是四個，而是四百萬、四千萬甚至四億個鍵值對，那麼要一次性将這些鍵值對全部 rehash 到 ht[1] 的話，龐大的計算量可能會導緻伺服器在一段時間内停止服務。

是以，為了避免 rehash 對伺服器性能造成影響，伺服器不是一次性将 ht[0] 的所有鍵值對全部rehash 到 ht[1] ，而是分多次、漸進式地将 ht[0] 的鍵值對慢慢地 rehash 到 ht[1] 。

以下是哈希表漸進式 rehash 的詳細步驟：

為 ht[1] 配置設定空間，讓字典同時持有 ht[0] 和 ht[1] 兩個哈希表。
在字典中維持一個索引計數器變量 rehashidx ，并将它的值設定為 0 ，表示 rehash 工作正式開始。
在 rehash 進行期間，每次對字典執行添加、删除、查找或者更新操作時，程式除了執⾏指定的操作以外，還會順帶将 ht[0] 哈希表在 rehashidx 索引上的所有鍵值對 rehash 到 ht[1] ，當 rehash 工作完成之後，程式将 rehashidx 屬性的值增1。
随着字典操作的不斷執行，最終在某個時間點上， ht[0] 的所有鍵值對都會被 rehash 至 ht[1] ，這時程式将 rehashidx 屬性的值設為 -1 ，表示 rehash 操作已完成。

漸進式 rehash 的好處在于它采取分而治之的方式，将 rehash 鍵值對所需的計算工作均灘到對字典的每個添加、删除、查找和更新操作上，甚至是背景啟一個定時器，每次時間循環時隻工作作1毫秒，進而避免了集中式 rehash 而帶來的龐大計算量。

/* This function handles 'background' operations we are requiredto do
* incrementally in Redis databases, such as active key expiring,resizing,
* rehashing. */
void databasesCron(void) {
	 ...
	
	/* Perform hash tables rehashing if needed, but only if thereare no
	* other processes saving the DB on disk. Otherwise rehashingis bad
	 * as will cause a lot of copy-on-write of memory pages. */
	 if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
	{
		 /* Resize */
		for (j = 0; j < dbs_per_call; j++) {
			tryResizeHashTables(resize_db % server.dbnum);
			resize_db++;
		}
		/* Rehash */
		if (server.activerehashing) {
			 for (j = 0; j < dbs_per_call; j++) {
				int work_done = incrementallyRehash(rehash_db);
				if (work_done) {
					 /* If the function did some work, stop here,we'll do
					* more at the next cron loop. */
					break;
				 } else {
					/* If this db didn't need rehash, we'll try the next one. */
					rehash_db++;
					rehash_db %= server.dbnum;
				 }
			}
	 	}
	}
}

/* Our hash table implementation performs rehashing incrementallywhile
* we write/read from the hash table. Still if the server is idle, the hash
 * table will use two tables for a long time. So we try to use 1millisecond
* of CPU time at every call of this function to perform some rehahsing.
*
* The function returns 1 if some rehashing was performed, otherwise 0
* is returned. */
 int incrementallyRehash(int dbid) {
	/* Keys dictionary */
	if (dictIsRehashing(server.db[dbid].dict)) {
	 	dictRehashMilliseconds(server.db[dbid].dict,1);
		return 1; /* already used our millisecond for this loop... */
	 }
	 /* Expires */
	if (dictIsRehashing(server.db[dbid].expires)) {
		dictRehashMilliseconds(server.db[dbid].expires,1);
	 	return 1; /* already used our millisecond for this loop... */
	}
	 return 0;
}

七、redisObject

hash表裡的一個個對象都是什麼呢？

typedef struct redisObject {
	unsigned type:4;
	unsigned encoding:4;
	unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
	* LFU data (least significant 8 bits frequency
	 * and most significant 16 bits accesstime). */
	 int refcount;
	 void *ptr;
} robj;

redis的對象有11種，之多，他們全都繼承于redisObject。

/* Objects encoding. Some kind of objects like Strings and Hashes can be
* internally represented in multiple ways. The 'encoding' fieldof the object
* is set to one of this fields for this object. */
#define OBJ_ENCODING_RAW 0 /* Raw representation */
#define OBJ_ENCODING_INT 1 /* Encoded as integer */
#define OBJ_ENCODING_HT 2 /* Encoded as hash table */
#define OBJ_ENCODING_ZIPMAP 3 /* Encoded as zipmap */
#define OBJ_ENCODING_LINKEDLIST 4 /* No longer used: old list encoding. */
#define OBJ_ENCODING_ZIPLIST 5 /* Encoded as ziplist */
#define OBJ_ENCODING_INTSET 6 /* Encoded as intset */
#define OBJ_ENCODING_SKIPLIST 7 /* Encoded as skiplist */
#define OBJ_ENCODING_EMBSTR 8 /* Embedded sds string encoding */
#define OBJ_ENCODING_QUICKLIST 9 /* Encoded as linked list of ziplists */
#define OBJ_ENCODING_STREAM 10 /* Encoded as a radix tree of listpacks */

Redis存儲結構探究

一、概述

二、hash算法

三、redis字典

3.1 redis的hash表

3.2 hash表的節點

3.3 redis字典

四、redis資料庫的結構

五、rehash

六、漸進式rehash

七、redisObject

繼續閱讀

秒懂JVM的三大參數類型，就靠這十個小實驗了初級—中級—進階三個級别的大廠面試真題阿裡雲——Java 實習生/初級美團——Java 中級螞蟻金服——Java 進階基礎篇JVM 篇MySQL 篇Redis 篇

CentOs7 安裝redis4.0 遇到的坑一、CentOs7連接配接網絡二、make檔案的時候出錯三、安裝ruby的redis插件時候報錯最後

docker 搭建Redis 問題解決

如何解決Redis緩存擊穿、雪崩、穿透問題

java 版本的redis-stat不能運作在背景和daemon

redis管理常用指令

django短信驗證碼的後端實作

Redis訂閱了一段時間後訂閱失效了（ redisTemplate.convertAndSend）

2022秋招面試總結（cpp+java+測開）百度測開一面位元組後端一面蝦皮後端一面蝦皮後端二面

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

微服務-性能壓測\緩存redis和分布式鎖redisson和SpringCache

Nacos 2.0 更新前後性能對比壓測

Spring資料和Redis

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

supervisor 管理redis 和httpd 環境centos7

Redis簡介一(單機版)發展曆程Redis