天天看点

levelDB源码分析-Memtable

        本节讲述内存中LevelDB的数据结构Memtable,Memtable在整个体系中的重要地位也不言而喻。总体而言,所有KV数据都是存储在Memtable,Immutable Memtable和SSTable中的,Immutable Memtable从结构上讲和Memtable是完全一样的,区别仅仅在于其是只读的,不允许写入操作,而Memtable则是允许写入和读取的。当Memtable写入的数据占用内存到达指定数量(Options.write_buffer_size),则自动转换为Immutable Memtable,等待Dump到磁盘中,系统会自动生成新的Memtable供写操作写入新数据,理解了Memtable,那么Immutable Memtable自然不在话下。

   LevelDb的MemTable提供了将KV数据写入,删除以及读取KV记录的操作接口,但是事实上Memtable并不存在真正的删除操作,删除某个Key的Value在Memtable内是作为插入一条记录实施的,但是会打上一个Key的删除标记,真正的删除操作是Lazy的,会在以后的Compaction过程中去掉这个KV。

   需要注意的是,LevelDb的Memtable中KV对是根据Key大小有序存储的,在系统插入新的KV时,LevelDb要把这个KV插到合适的位置上以保持这种Key有序性。其实,LevelDb的Memtable类只是一个接口类,真正的操作是通过背后的SkipList来做的,包括插入操作和读取操作等,所以Memtable的核心数据结构是一个SkipList。

   SkipList是由William Pugh发明。他在Communications of the ACM June 1990, 33(6) 668-676 发表了Skip lists: a probabilistic alternative to balanced trees,在该论文中详细解释了SkipList的数据结构和插入删除操作。

    SkipList是平衡树的一种替代数据结构,但是和红黑树不相同的是,SkipList对于树的平衡的实现是基于一种随机化的算法的,这样也就是说SkipList的插入和删除的工作是比较简单的。

    关于SkipList的详细介绍可以参考《levelDB源码分析-Skiplist》或 http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html,讲述的很清楚,LevelDb的SkipList基本上是一个具体实现,并无特殊之处。

  SkipList不仅是维护有序数据的一个简单实现,而且相比较平衡树来说,在插入数据的时候可以避免频繁的树节点调整操作,所以写入效率是很高的,LevelDb整体而言是个高写入系统,SkipList在其中应该也起到了很重要的作用。Redis为了加快插入操作,也使用了SkipList来作为内部实现数据结构。

  下面介绍MemTable的定义及介绍:

  MemTable是基于引用计数的,每次使用需要首先调用Ref(),使用完毕时调用Unref();

class MemTable {
    public:
        // MemTables are reference counted.  The initial reference count
        // is zero and the caller must call Ref() at least once.
        explicit MemTable(const InternalKeyComparator& comparator);					// 构造函数,传入比较器comparator(见下面介绍)

        // Increase reference count.
        void Ref() { ++refs_; }												// 增加引用计数

        // Drop reference count.  Delete if no more references exist.
        void Unref() {														// 减少引用计数
            --refs_;
            assert(refs_ >= 0);
            if (refs_ <= 0) {
                delete this;
            }
        }

        // Returns an estimate of the number of bytes of data in use by this
        // data structure.
        //
        // REQUIRES: external synchronization to prevent simultaneous
        // operations on the same MemTable.
        size_t ApproximateMemoryUsage();										// 返回使用容量

        // Return an iterator that yields the contents of the memtable.
        //
        // The caller must ensure that the underlying MemTable remains live
        // while the returned iterator is live.  The keys returned by this
        // iterator are internal keys encoded by AppendInternalKey in the
        // db/format.{h,cc} module.
        Iterator* NewIterator();												// 迭代器

        // Add an entry into memtable that maps key to value at the
        // specified sequence number and with the specified type.
        // Typically value will be empty if type==kTypeDeletion.
        void Add(SequenceNumber seq, ValueType type,								// 增加一个key/value到MemTable中
        		   const Slice& key,											// type=kTypeDeletion时,value为空
        		   const Slice& value);

        // If memtable contains a value for key, store it in *value and return true.
        // If memtable contains a deletion for key, store a NotFound() error
        // in *status and return true.
        // Else, return false.
        bool Get(const LookupKey& key, std::string* value, Status* s);				// 查询接口

    private:
        ~MemTable();  // Private since only Unref() should be used to delete it		// 采用引用计数,定义为private,外部不能删除

        struct KeyComparator {
            const InternalKeyComparator comparator;
            explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) { }
            int operator()(const char* a, const char* b) const;
            };
        friend class MemTableIterator;
        friend class MemTableBackwardIterator;

        typedef SkipList<const char*, KeyComparator> Table;	
        Table table_;														// Skiplist				

        KeyComparator comparator_;											// 比较器comparator_

        Arena arena_;														// 内存池
        
        int refs_;															// 引用计数

        // No copying allowed
        MemTable(const MemTable&);											// 拷贝构造函数及赋值操作不允许
        void operator=(const MemTable&);
        };
 
           
    // Modules in this directory should keep internal keys wrapped inside
    // the following class instead of plain strings so that we do not
    // incorrectly use string comparisons instead of an InternalKeyComparator.
    class InternalKey {														// 内部Key
      private:
        std::string rep_;
      public:
        InternalKey() { }   // Leave rep_ as empty to indicate it is invalid
        InternalKey(const Slice& user_key, SequenceNumber s, ValueType t) {
            AppendInternalKey(&rep_, ParsedInternalKey(user_key, s, t));
        }

        void DecodeFrom(const Slice& s) { rep_.assign(s.data(), s.size()); }			// 转化为string
        Slice Encode() const {												// 返回string
            assert(!rep_.empty());
            return rep_;
        }

        Slice user_key() const { return ExtractUserKey(rep_); }					// 返回User Key

        void SetFrom(const ParsedInternalKey& p) {
            rep_.clear();
            AppendInternalKey(&rep_, p);
        }

        void Clear() { rep_.clear(); }

        std::string DebugString() const;
    };
    
    struct ParsedInternalKey {												//  InternalKey格式:
        Slice user_key;             											//+   user key
        SequenceNumber sequence;    											//+   sequence
        ValueType type;             											//+   type: kTypeDeletion or kTypeValue

        ParsedInternalKey() { }  // Intentionally left uninitialized (for speed)
        ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)		// 构造函数
            : user_key(u), sequence(seq), type(t) { }
        std::string DebugString() const;
    }; 
    
    // Returns the user key portion of an internal key.
    inline Slice ExtractUserKey(const Slice& internal_key) {						// 返回User Key
        assert(internal_key.size() >= 8);
        return Slice(internal_key.data(), internal_key.size() - 8);
    }

    inline ValueType ExtractValueType(const Slice& internal_key) {					// 返回Type
        assert(internal_key.size() >= 8);
        const size_t n = internal_key.size();
        uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
        unsigned char c = num & 0xff;
        return static_cast<ValueType>(c);
    }
    void AppendInternalKey(std::string* result, const ParsedInternalKey& key) {		// 将InternalKey编译为string
        result->append(key.user_key.data(), key.user_key.size());
        PutFixed64(result, PackSequenceAndType(key.sequence, key.type));
    }
    static uint64_t PackSequenceAndType(uint64_t seq, ValueType t) {				// sequence | type
        assert(seq <= kMaxSequenceNumber);
        assert(t <= kValueTypeForSeek);
        return (seq << 8) | t;
    }
    
    // A comparator for internal keys that uses a specified comparator for
    // the user key portion and breaks ties by decreasing sequence number.
    class InternalKeyComparator : public Comparator {
        private:
        const Comparator* user_comparator_;
        public:
        explicit InternalKeyComparator(const Comparator* c) : user_comparator_(c) { }
        virtual const char* Name() const;
        virtual int Compare(const Slice& a, const Slice& b) const;
        virtual void FindShortestSeparator(std::string* start,
        								const Slice& limit) const;
        virtual void FindShortSuccessor(std::string* key) const;

        const Comparator* user_comparator() const { return user_comparator_; }

        int Compare(const InternalKey& a, const InternalKey& b) const;
   };


    inline int InternalKeyComparator::Compare(const InternalKey& a, const InternalKey& b) const {
        return Compare(a.Encode(), b.Encode());	// Encode返回string,默认采用string的字典序比较
    }
    int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const {
        // Order by:
        //    increasing user key (according to user-supplied comparator)
        //    decreasing sequence number
        //    decreasing type (though sequence# should be enough to disambiguate)
        int r = user_comparator_->Compare(ExtractUserKey(akey), ExtractUserKey(bkey));
        if (r == 0) {
            const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
            const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
            if (anum > bnum) {
                r = -1;
                } else if (anum < bnum) {
                r = +1;
            }
        }
        return r;
    }


    inline bool ParseInternalKey(const Slice& internal_key,						// 解析InternalKey
    						    ParsedInternalKey* result) {				
        const size_t n = internal_key.size();
        if (n < 8) return false;
        uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
        unsigned char c = num & 0xff;											// type:1 byte
        result->sequence = num >> 8;											// 最后的8字节,最后1字节为type,其余为sequence
        result->type = static_cast<ValueType>(c);
        result->user_key = Slice(internal_key.data(), n - 8);						// User Key
        return (c <= static_cast<unsigned char>(kTypeValue));						// 返回值为:type是否合法
    }
   
 // A helper class useful for DBImpl::Get()
    class LookupKey {														// 查找LookupKey
        public:
        // Initialize *this for looking up user_key at a snapshot with
        // the specified sequence number.
        LookupKey(const Slice& user_key, SequenceNumber sequence);

        ~LookupKey();

        // Return a key suitable for lookup in a MemTable.
        Slice memtable_key() const { return Slice(start_, end_ - start_); }

        // Return an internal key (suitable for passing to an internal iterator)
        Slice internal_key() const { return Slice(kstart_, end_ - kstart_); }

        // Return the user key
        Slice user_key() const { return Slice(kstart_, end_ - kstart_ - 8); }

        private:
        // We construct a char array of the form:
        //    klength  varint32               <-- start_
        //    userkey  char[klength]          <-- kstart_
        //    tag      uint64
        //                                    <-- end_
        // The array is a suitable MemTable key.
        // The suffix starting with "userkey" can be used as an InternalKey.
        const char* start_;
        const char* kstart_;
        const char* end_;
        char space_[200];      // Avoid allocation for short keys

        // No copying allowed
        LookupKey(const LookupKey&);
        void operator=(const LookupKey&);
        };

    inline LookupKey::~LookupKey() {
        if (start_ != space_) delete[] start_;
    }
    
    
  使用:
   	class MemTableInserter : public WriteBatch::Handler {				// MemTableInserter直接操作Memtable
            public:
            SequenceNumber sequence_;
            MemTable* mem_;

            virtual void Put(const Slice& key, const Slice& value) {		// 插入记录
                mem_->Add(sequence_, kTypeValue, key, value);
                sequence_++;
            }
            virtual void Delete(const Slice& key) {						// 删除记录
                mem_->Add(sequence_, kTypeDeletion, key, Slice());
                sequence_++;
            }
    };  
    Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) {	// 
        MemTableInserter inserter;
        inserter.sequence_ = WriteBatchInternal::Sequence(b);
        inserter.mem_ = memtable;
        return b->Iterate(&inserter);
    }
    
    
           

   总结:

        1、SkipList中保存的数据格式对应一个LookupKey+Value,其格式为:

            (string)"internel-key-size + internal-key-data + value-size + value-data",# LookupKey序列化 + Value

            即:(string)LookupKey + value-size+value-data            

        2、InternalKey数据格式为:(string)"user-key + sequence + type", 对应ParsedInternalKey,其中sequnce+type占用8bytes,type占用1字节

        ParsedInternalKey:

            user key

            sequence number

            type

        InternalKey:

            string(ParsedInternalKey)   # ParseInternalKey的序列化字符串

        LookupKey:

            internal-key-size + InternalKey         # InternalKey长度 + InternalKey

        3、疑问:Memtable的Add方法调用table_.Insert(buf)时,buf为InternalKey+Value,那么Skiplist中key的排序包括了value部分吗?

        因为Skiplist声明为模板类,并且传入了Comparator,这里传入的是InternalKeyComparator对象,Skiplist中比较时,执行的是InternalKeyComparator::Compare函数,里面调用ExtractUserKey抽取了user key,所以最终比较的还是User Key,value部分并没有参与比较。

        Skiplist中的Node中的数据为:InternalKey+Value-size+Value-data。

继续阅读