本節講述記憶體中LevelDB的資料結構Memtable,Memtable在整個體系中的重要地位也不言而喻。總體而言,所有KV資料都是存儲在Memtable,Immutable Memtable和SSTable中的,Immutable Memtable從結構上講和Memtable是完全一樣的,差別僅僅在于其是隻讀的,不允許寫入操作,而Memtable則是允許寫入和讀取的。當Memtable寫入的資料占用記憶體到達指定數量(Options.write_buffer_size),則自動轉換為Immutable Memtable,等待Dump到磁盤中,系統會自動生成新的Memtable供寫操作寫入新資料,了解了Memtable,那麼Immutable Memtable自然不在話下。
LevelDb的MemTable提供了将KV資料寫入,删除以及讀取KV記錄的操作接口,但是事實上Memtable并不存在真正的删除操作,删除某個Key的Value在Memtable内是作為插入一條記錄實施的,但是會打上一個Key的删除标記,真正的删除操作是Lazy的,會在以後的Compaction過程中去掉這個KV。
需要注意的是,LevelDb的Memtable中KV對是根據Key大小有序存儲的,在系統插入新的KV時,LevelDb要把這個KV插到合适的位置上以保持這種Key有序性。其實,LevelDb的Memtable類隻是一個接口類,真正的操作是通過背後的SkipList來做的,包括插入操作和讀取操作等,是以Memtable的核心資料結構是一個SkipList。
SkipList是由William Pugh發明。他在Communications of the ACM June 1990, 33(6) 668-676 發表了Skip lists: a probabilistic alternative to balanced trees,在該論文中詳細解釋了SkipList的資料結構和插入删除操作。
SkipList是平衡樹的一種替代資料結構,但是和紅黑樹不相同的是,SkipList對于樹的平衡的實作是基于一種随機化的算法的,這樣也就是說SkipList的插入和删除的工作是比較簡單的。
關于SkipList的詳細介紹可以參考《levelDB源碼分析-Skiplist》或 http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html,講述的很清楚,LevelDb的SkipList基本上是一個具體實作,并無特殊之處。
SkipList不僅是維護有序資料的一個簡單實作,而且相比較平衡樹來說,在插入資料的時候可以避免頻繁的樹節點調整操作,是以寫入效率是很高的,LevelDb整體而言是個高寫入系統,SkipList在其中應該也起到了很重要的作用。Redis為了加快插入操作,也使用了SkipList來作為内部實作資料結構。
下面介紹MemTable的定義及介紹:
MemTable是基于引用計數的,每次使用需要首先調用Ref(),使用完畢時調用Unref();
class MemTable {
public:
// MemTables are reference counted. The initial reference count
// is zero and the caller must call Ref() at least once.
explicit MemTable(const InternalKeyComparator& comparator); // 構造函數,傳入比較器comparator(見下面介紹)
// Increase reference count.
void Ref() { ++refs_; } // 增加引用計數
// Drop reference count. Delete if no more references exist.
void Unref() { // 減少引用計數
--refs_;
assert(refs_ >= 0);
if (refs_ <= 0) {
delete this;
}
}
// Returns an estimate of the number of bytes of data in use by this
// data structure.
//
// REQUIRES: external synchronization to prevent simultaneous
// operations on the same MemTable.
size_t ApproximateMemoryUsage(); // 傳回使用容量
// Return an iterator that yields the contents of the memtable.
//
// The caller must ensure that the underlying MemTable remains live
// while the returned iterator is live. The keys returned by this
// iterator are internal keys encoded by AppendInternalKey in the
// db/format.{h,cc} module.
Iterator* NewIterator(); // 疊代器
// Add an entry into memtable that maps key to value at the
// specified sequence number and with the specified type.
// Typically value will be empty if type==kTypeDeletion.
void Add(SequenceNumber seq, ValueType type, // 增加一個key/value到MemTable中
const Slice& key, // type=kTypeDeletion時,value為空
const Slice& value);
// If memtable contains a value for key, store it in *value and return true.
// If memtable contains a deletion for key, store a NotFound() error
// in *status and return true.
// Else, return false.
bool Get(const LookupKey& key, std::string* value, Status* s); // 查詢接口
private:
~MemTable(); // Private since only Unref() should be used to delete it // 采用引用計數,定義為private,外部不能删除
struct KeyComparator {
const InternalKeyComparator comparator;
explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) { }
int operator()(const char* a, const char* b) const;
};
friend class MemTableIterator;
friend class MemTableBackwardIterator;
typedef SkipList<const char*, KeyComparator> Table;
Table table_; // Skiplist
KeyComparator comparator_; // 比較器comparator_
Arena arena_; // 記憶體池
int refs_; // 引用計數
// No copying allowed
MemTable(const MemTable&); // 拷貝構造函數及指派操作不允許
void operator=(const MemTable&);
};
// Modules in this directory should keep internal keys wrapped inside
// the following class instead of plain strings so that we do not
// incorrectly use string comparisons instead of an InternalKeyComparator.
class InternalKey { // 内部Key
private:
std::string rep_;
public:
InternalKey() { } // Leave rep_ as empty to indicate it is invalid
InternalKey(const Slice& user_key, SequenceNumber s, ValueType t) {
AppendInternalKey(&rep_, ParsedInternalKey(user_key, s, t));
}
void DecodeFrom(const Slice& s) { rep_.assign(s.data(), s.size()); } // 轉化為string
Slice Encode() const { // 傳回string
assert(!rep_.empty());
return rep_;
}
Slice user_key() const { return ExtractUserKey(rep_); } // 傳回User Key
void SetFrom(const ParsedInternalKey& p) {
rep_.clear();
AppendInternalKey(&rep_, p);
}
void Clear() { rep_.clear(); }
std::string DebugString() const;
};
struct ParsedInternalKey { // InternalKey格式:
Slice user_key; //+ user key
SequenceNumber sequence; //+ sequence
ValueType type; //+ type: kTypeDeletion or kTypeValue
ParsedInternalKey() { } // Intentionally left uninitialized (for speed)
ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t) // 構造函數
: user_key(u), sequence(seq), type(t) { }
std::string DebugString() const;
};
// Returns the user key portion of an internal key.
inline Slice ExtractUserKey(const Slice& internal_key) { // 傳回User Key
assert(internal_key.size() >= 8);
return Slice(internal_key.data(), internal_key.size() - 8);
}
inline ValueType ExtractValueType(const Slice& internal_key) { // 傳回Type
assert(internal_key.size() >= 8);
const size_t n = internal_key.size();
uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
unsigned char c = num & 0xff;
return static_cast<ValueType>(c);
}
void AppendInternalKey(std::string* result, const ParsedInternalKey& key) { // 将InternalKey編譯為string
result->append(key.user_key.data(), key.user_key.size());
PutFixed64(result, PackSequenceAndType(key.sequence, key.type));
}
static uint64_t PackSequenceAndType(uint64_t seq, ValueType t) { // sequence | type
assert(seq <= kMaxSequenceNumber);
assert(t <= kValueTypeForSeek);
return (seq << 8) | t;
}
// A comparator for internal keys that uses a specified comparator for
// the user key portion and breaks ties by decreasing sequence number.
class InternalKeyComparator : public Comparator {
private:
const Comparator* user_comparator_;
public:
explicit InternalKeyComparator(const Comparator* c) : user_comparator_(c) { }
virtual const char* Name() const;
virtual int Compare(const Slice& a, const Slice& b) const;
virtual void FindShortestSeparator(std::string* start,
const Slice& limit) const;
virtual void FindShortSuccessor(std::string* key) const;
const Comparator* user_comparator() const { return user_comparator_; }
int Compare(const InternalKey& a, const InternalKey& b) const;
};
inline int InternalKeyComparator::Compare(const InternalKey& a, const InternalKey& b) const {
return Compare(a.Encode(), b.Encode()); // Encode傳回string,預設采用string的字典序比較
}
int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const {
// Order by:
// increasing user key (according to user-supplied comparator)
// decreasing sequence number
// decreasing type (though sequence# should be enough to disambiguate)
int r = user_comparator_->Compare(ExtractUserKey(akey), ExtractUserKey(bkey));
if (r == 0) {
const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
if (anum > bnum) {
r = -1;
} else if (anum < bnum) {
r = +1;
}
}
return r;
}
inline bool ParseInternalKey(const Slice& internal_key, // 解析InternalKey
ParsedInternalKey* result) {
const size_t n = internal_key.size();
if (n < 8) return false;
uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
unsigned char c = num & 0xff; // type:1 byte
result->sequence = num >> 8; // 最後的8位元組,最後1位元組為type,其餘為sequence
result->type = static_cast<ValueType>(c);
result->user_key = Slice(internal_key.data(), n - 8); // User Key
return (c <= static_cast<unsigned char>(kTypeValue)); // 傳回值為:type是否合法
}
// A helper class useful for DBImpl::Get()
class LookupKey { // 查找LookupKey
public:
// Initialize *this for looking up user_key at a snapshot with
// the specified sequence number.
LookupKey(const Slice& user_key, SequenceNumber sequence);
~LookupKey();
// Return a key suitable for lookup in a MemTable.
Slice memtable_key() const { return Slice(start_, end_ - start_); }
// Return an internal key (suitable for passing to an internal iterator)
Slice internal_key() const { return Slice(kstart_, end_ - kstart_); }
// Return the user key
Slice user_key() const { return Slice(kstart_, end_ - kstart_ - 8); }
private:
// We construct a char array of the form:
// klength varint32 <-- start_
// userkey char[klength] <-- kstart_
// tag uint64
// <-- end_
// The array is a suitable MemTable key.
// The suffix starting with "userkey" can be used as an InternalKey.
const char* start_;
const char* kstart_;
const char* end_;
char space_[200]; // Avoid allocation for short keys
// No copying allowed
LookupKey(const LookupKey&);
void operator=(const LookupKey&);
};
inline LookupKey::~LookupKey() {
if (start_ != space_) delete[] start_;
}
使用:
class MemTableInserter : public WriteBatch::Handler { // MemTableInserter直接操作Memtable
public:
SequenceNumber sequence_;
MemTable* mem_;
virtual void Put(const Slice& key, const Slice& value) { // 插入記錄
mem_->Add(sequence_, kTypeValue, key, value);
sequence_++;
}
virtual void Delete(const Slice& key) { // 删除記錄
mem_->Add(sequence_, kTypeDeletion, key, Slice());
sequence_++;
}
};
Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) { //
MemTableInserter inserter;
inserter.sequence_ = WriteBatchInternal::Sequence(b);
inserter.mem_ = memtable;
return b->Iterate(&inserter);
}
總結:
1、SkipList中儲存的資料格式對應一個LookupKey+Value,其格式為:
(string)"internel-key-size + internal-key-data + value-size + value-data",# LookupKey序列化 + Value
即:(string)LookupKey + value-size+value-data
2、InternalKey資料格式為:(string)"user-key + sequence + type", 對應ParsedInternalKey,其中sequnce+type占用8bytes,type占用1位元組
ParsedInternalKey:
user key
sequence number
type
InternalKey:
string(ParsedInternalKey) # ParseInternalKey的序列化字元串
LookupKey:
internal-key-size + InternalKey # InternalKey長度 + InternalKey
3、疑問:Memtable的Add方法調用table_.Insert(buf)時,buf為InternalKey+Value,那麼Skiplist中key的排序包括了value部分嗎?
因為Skiplist聲明為模闆類,并且傳入了Comparator,這裡傳入的是InternalKeyComparator對象,Skiplist中比較時,執行的是InternalKeyComparator::Compare函數,裡面調用ExtractUserKey抽取了user key,是以最終比較的還是User Key,value部分并沒有參與比較。
Skiplist中的Node中的資料為:InternalKey+Value-size+Value-data。