天天看點

levelDB源碼分析-Memtable

        本節講述記憶體中LevelDB的資料結構Memtable,Memtable在整個體系中的重要地位也不言而喻。總體而言,所有KV資料都是存儲在Memtable,Immutable Memtable和SSTable中的,Immutable Memtable從結構上講和Memtable是完全一樣的,差別僅僅在于其是隻讀的,不允許寫入操作,而Memtable則是允許寫入和讀取的。當Memtable寫入的資料占用記憶體到達指定數量(Options.write_buffer_size),則自動轉換為Immutable Memtable,等待Dump到磁盤中,系統會自動生成新的Memtable供寫操作寫入新資料,了解了Memtable,那麼Immutable Memtable自然不在話下。

   LevelDb的MemTable提供了将KV資料寫入,删除以及讀取KV記錄的操作接口,但是事實上Memtable并不存在真正的删除操作,删除某個Key的Value在Memtable内是作為插入一條記錄實施的,但是會打上一個Key的删除标記,真正的删除操作是Lazy的,會在以後的Compaction過程中去掉這個KV。

   需要注意的是,LevelDb的Memtable中KV對是根據Key大小有序存儲的,在系統插入新的KV時,LevelDb要把這個KV插到合适的位置上以保持這種Key有序性。其實,LevelDb的Memtable類隻是一個接口類,真正的操作是通過背後的SkipList來做的,包括插入操作和讀取操作等,是以Memtable的核心資料結構是一個SkipList。

   SkipList是由William Pugh發明。他在Communications of the ACM June 1990, 33(6) 668-676 發表了Skip lists: a probabilistic alternative to balanced trees,在該論文中詳細解釋了SkipList的資料結構和插入删除操作。

    SkipList是平衡樹的一種替代資料結構,但是和紅黑樹不相同的是,SkipList對于樹的平衡的實作是基于一種随機化的算法的,這樣也就是說SkipList的插入和删除的工作是比較簡單的。

    關于SkipList的詳細介紹可以參考《levelDB源碼分析-Skiplist》或 http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html,講述的很清楚,LevelDb的SkipList基本上是一個具體實作,并無特殊之處。

  SkipList不僅是維護有序資料的一個簡單實作,而且相比較平衡樹來說,在插入資料的時候可以避免頻繁的樹節點調整操作,是以寫入效率是很高的,LevelDb整體而言是個高寫入系統,SkipList在其中應該也起到了很重要的作用。Redis為了加快插入操作,也使用了SkipList來作為内部實作資料結構。

  下面介紹MemTable的定義及介紹:

  MemTable是基于引用計數的,每次使用需要首先調用Ref(),使用完畢時調用Unref();

class MemTable {
    public:
        // MemTables are reference counted.  The initial reference count
        // is zero and the caller must call Ref() at least once.
        explicit MemTable(const InternalKeyComparator& comparator);					// 構造函數,傳入比較器comparator(見下面介紹)

        // Increase reference count.
        void Ref() { ++refs_; }												// 增加引用計數

        // Drop reference count.  Delete if no more references exist.
        void Unref() {														// 減少引用計數
            --refs_;
            assert(refs_ >= 0);
            if (refs_ <= 0) {
                delete this;
            }
        }

        // Returns an estimate of the number of bytes of data in use by this
        // data structure.
        //
        // REQUIRES: external synchronization to prevent simultaneous
        // operations on the same MemTable.
        size_t ApproximateMemoryUsage();										// 傳回使用容量

        // Return an iterator that yields the contents of the memtable.
        //
        // The caller must ensure that the underlying MemTable remains live
        // while the returned iterator is live.  The keys returned by this
        // iterator are internal keys encoded by AppendInternalKey in the
        // db/format.{h,cc} module.
        Iterator* NewIterator();												// 疊代器

        // Add an entry into memtable that maps key to value at the
        // specified sequence number and with the specified type.
        // Typically value will be empty if type==kTypeDeletion.
        void Add(SequenceNumber seq, ValueType type,								// 增加一個key/value到MemTable中
        		   const Slice& key,											// type=kTypeDeletion時,value為空
        		   const Slice& value);

        // If memtable contains a value for key, store it in *value and return true.
        // If memtable contains a deletion for key, store a NotFound() error
        // in *status and return true.
        // Else, return false.
        bool Get(const LookupKey& key, std::string* value, Status* s);				// 查詢接口

    private:
        ~MemTable();  // Private since only Unref() should be used to delete it		// 采用引用計數,定義為private,外部不能删除

        struct KeyComparator {
            const InternalKeyComparator comparator;
            explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) { }
            int operator()(const char* a, const char* b) const;
            };
        friend class MemTableIterator;
        friend class MemTableBackwardIterator;

        typedef SkipList<const char*, KeyComparator> Table;	
        Table table_;														// Skiplist				

        KeyComparator comparator_;											// 比較器comparator_

        Arena arena_;														// 記憶體池
        
        int refs_;															// 引用計數

        // No copying allowed
        MemTable(const MemTable&);											// 拷貝構造函數及指派操作不允許
        void operator=(const MemTable&);
        };
 
           
    // Modules in this directory should keep internal keys wrapped inside
    // the following class instead of plain strings so that we do not
    // incorrectly use string comparisons instead of an InternalKeyComparator.
    class InternalKey {														// 内部Key
      private:
        std::string rep_;
      public:
        InternalKey() { }   // Leave rep_ as empty to indicate it is invalid
        InternalKey(const Slice& user_key, SequenceNumber s, ValueType t) {
            AppendInternalKey(&rep_, ParsedInternalKey(user_key, s, t));
        }

        void DecodeFrom(const Slice& s) { rep_.assign(s.data(), s.size()); }			// 轉化為string
        Slice Encode() const {												// 傳回string
            assert(!rep_.empty());
            return rep_;
        }

        Slice user_key() const { return ExtractUserKey(rep_); }					// 傳回User Key

        void SetFrom(const ParsedInternalKey& p) {
            rep_.clear();
            AppendInternalKey(&rep_, p);
        }

        void Clear() { rep_.clear(); }

        std::string DebugString() const;
    };
    
    struct ParsedInternalKey {												//  InternalKey格式:
        Slice user_key;             											//+   user key
        SequenceNumber sequence;    											//+   sequence
        ValueType type;             											//+   type: kTypeDeletion or kTypeValue

        ParsedInternalKey() { }  // Intentionally left uninitialized (for speed)
        ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)		// 構造函數
            : user_key(u), sequence(seq), type(t) { }
        std::string DebugString() const;
    }; 
    
    // Returns the user key portion of an internal key.
    inline Slice ExtractUserKey(const Slice& internal_key) {						// 傳回User Key
        assert(internal_key.size() >= 8);
        return Slice(internal_key.data(), internal_key.size() - 8);
    }

    inline ValueType ExtractValueType(const Slice& internal_key) {					// 傳回Type
        assert(internal_key.size() >= 8);
        const size_t n = internal_key.size();
        uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
        unsigned char c = num & 0xff;
        return static_cast<ValueType>(c);
    }
    void AppendInternalKey(std::string* result, const ParsedInternalKey& key) {		// 将InternalKey編譯為string
        result->append(key.user_key.data(), key.user_key.size());
        PutFixed64(result, PackSequenceAndType(key.sequence, key.type));
    }
    static uint64_t PackSequenceAndType(uint64_t seq, ValueType t) {				// sequence | type
        assert(seq <= kMaxSequenceNumber);
        assert(t <= kValueTypeForSeek);
        return (seq << 8) | t;
    }
    
    // A comparator for internal keys that uses a specified comparator for
    // the user key portion and breaks ties by decreasing sequence number.
    class InternalKeyComparator : public Comparator {
        private:
        const Comparator* user_comparator_;
        public:
        explicit InternalKeyComparator(const Comparator* c) : user_comparator_(c) { }
        virtual const char* Name() const;
        virtual int Compare(const Slice& a, const Slice& b) const;
        virtual void FindShortestSeparator(std::string* start,
        								const Slice& limit) const;
        virtual void FindShortSuccessor(std::string* key) const;

        const Comparator* user_comparator() const { return user_comparator_; }

        int Compare(const InternalKey& a, const InternalKey& b) const;
   };


    inline int InternalKeyComparator::Compare(const InternalKey& a, const InternalKey& b) const {
        return Compare(a.Encode(), b.Encode());	// Encode傳回string,預設采用string的字典序比較
    }
    int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const {
        // Order by:
        //    increasing user key (according to user-supplied comparator)
        //    decreasing sequence number
        //    decreasing type (though sequence# should be enough to disambiguate)
        int r = user_comparator_->Compare(ExtractUserKey(akey), ExtractUserKey(bkey));
        if (r == 0) {
            const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
            const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
            if (anum > bnum) {
                r = -1;
                } else if (anum < bnum) {
                r = +1;
            }
        }
        return r;
    }


    inline bool ParseInternalKey(const Slice& internal_key,						// 解析InternalKey
    						    ParsedInternalKey* result) {				
        const size_t n = internal_key.size();
        if (n < 8) return false;
        uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
        unsigned char c = num & 0xff;											// type:1 byte
        result->sequence = num >> 8;											// 最後的8位元組,最後1位元組為type,其餘為sequence
        result->type = static_cast<ValueType>(c);
        result->user_key = Slice(internal_key.data(), n - 8);						// User Key
        return (c <= static_cast<unsigned char>(kTypeValue));						// 傳回值為:type是否合法
    }
   
 // A helper class useful for DBImpl::Get()
    class LookupKey {														// 查找LookupKey
        public:
        // Initialize *this for looking up user_key at a snapshot with
        // the specified sequence number.
        LookupKey(const Slice& user_key, SequenceNumber sequence);

        ~LookupKey();

        // Return a key suitable for lookup in a MemTable.
        Slice memtable_key() const { return Slice(start_, end_ - start_); }

        // Return an internal key (suitable for passing to an internal iterator)
        Slice internal_key() const { return Slice(kstart_, end_ - kstart_); }

        // Return the user key
        Slice user_key() const { return Slice(kstart_, end_ - kstart_ - 8); }

        private:
        // We construct a char array of the form:
        //    klength  varint32               <-- start_
        //    userkey  char[klength]          <-- kstart_
        //    tag      uint64
        //                                    <-- end_
        // The array is a suitable MemTable key.
        // The suffix starting with "userkey" can be used as an InternalKey.
        const char* start_;
        const char* kstart_;
        const char* end_;
        char space_[200];      // Avoid allocation for short keys

        // No copying allowed
        LookupKey(const LookupKey&);
        void operator=(const LookupKey&);
        };

    inline LookupKey::~LookupKey() {
        if (start_ != space_) delete[] start_;
    }
    
    
  使用:
   	class MemTableInserter : public WriteBatch::Handler {				// MemTableInserter直接操作Memtable
            public:
            SequenceNumber sequence_;
            MemTable* mem_;

            virtual void Put(const Slice& key, const Slice& value) {		// 插入記錄
                mem_->Add(sequence_, kTypeValue, key, value);
                sequence_++;
            }
            virtual void Delete(const Slice& key) {						// 删除記錄
                mem_->Add(sequence_, kTypeDeletion, key, Slice());
                sequence_++;
            }
    };  
    Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) {	// 
        MemTableInserter inserter;
        inserter.sequence_ = WriteBatchInternal::Sequence(b);
        inserter.mem_ = memtable;
        return b->Iterate(&inserter);
    }
    
    
           

   總結:

        1、SkipList中儲存的資料格式對應一個LookupKey+Value,其格式為:

            (string)"internel-key-size + internal-key-data + value-size + value-data",# LookupKey序列化 + Value

            即:(string)LookupKey + value-size+value-data            

        2、InternalKey資料格式為:(string)"user-key + sequence + type", 對應ParsedInternalKey,其中sequnce+type占用8bytes,type占用1位元組

        ParsedInternalKey:

            user key

            sequence number

            type

        InternalKey:

            string(ParsedInternalKey)   # ParseInternalKey的序列化字元串

        LookupKey:

            internal-key-size + InternalKey         # InternalKey長度 + InternalKey

        3、疑問:Memtable的Add方法調用table_.Insert(buf)時,buf為InternalKey+Value,那麼Skiplist中key的排序包括了value部分嗎?

        因為Skiplist聲明為模闆類,并且傳入了Comparator,這裡傳入的是InternalKeyComparator對象,Skiplist中比較時,執行的是InternalKeyComparator::Compare函數,裡面調用ExtractUserKey抽取了user key,是以最終比較的還是User Key,value部分并沒有參與比較。

        Skiplist中的Node中的資料為:InternalKey+Value-size+Value-data。

繼續閱讀