PostgreSQL数据库缓冲区管理器——概述

缓冲区管理器结构

PostgreSQL缓冲区管理器由缓冲区表层、缓冲区描述符层和缓冲池层构成（The PostgreSQL buffer manager comprises a buffer table, buffer descriptors, and buffer pool）。

缓冲区表层

缓冲区表层是存放buffer_tags和buffer_ids关系的哈希表，buffer_ids表示存储的数据页的元数据buffer描述符的序号，而buffer_tags是关系表数据页的标识符（The buffer table is a hash table that stores the relations between the buffer_tags of stored pages and the buffer_ids of the descriptors that hold the stored pages’ respective metadata.）。

Buffer Tags类似于快递的订单号（全国唯一的）。

PostgreSQL中所有数据文件中的每个数据页都被分配了一个独特的标签（buffer_tag）。buffer_tag由三个值组成：RelFileNode、,fork number、数据页中的块号。关系表的fork number被定义为0，freespace maps的fork number被定义为1，visibility maps的fork number被定义为2。比如，buffer_tag '{(16821, 16384, 37721), 0, 7}'标记的数据页是oid为37721的关系表（该关系表在oid为16384的数据库且oid为16821的命名空间中）中的第7块数据。For example, the buffer_tag ‘{(16821, 16384, 37721), 0, 7}’ identifies the page that is in the seventh block whose relation’s OID and fork number are 37721 and 0, respectively; the relation is contained in the database whose OID is 16384 under the tablespace whose OID is 16821. Similarly, the buffer_tag ‘{(16821, 16384, 37721), 1, 3}’ identifies the page that is in the third block of the freespace map whose OID and fork number are 37721 and 1, respectively.

typedef struct buftag{
  RelFileNode rnode;      /* physical relation identifier */
  ForkNumber  forkNum;
  BlockNumber blockNum;    /* blknum relative to begin of reln */
} BufferTag;

Buffer Table类似于快递查询系统，通过快递单号buffer_tags查询取件码buffer_id

缓冲表可以在逻辑上被分为三个部分：哈希函数、哈希桶slot、条目（A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries.）。内置哈希函数将buffer_tags映射到哈希桶slot中，即使哈希桶slot的数量大于缓冲区slot，哈希冲突依然会发生。所以，缓冲表使用链表来解决冲突。当数据条目被映射到相同哈希表slot中时，该方法会将数据条目添加到相同链表中。如下为数据条目的结构体（A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page’s metadata. For example, a data entry ‘Tag_A, id=1’ means that the buffer descriptor with buffer_id 1 stores metadata of the page tagged with Tag_A.）：

/* entry for buffer lookup hashtable */
typedef struct {
  BufferTag  key;      /* Tag of a disk page */
  int      id;        /* Associated buffer ID */
} BufferLookupEnt;

The built-in hash function maps buffer_tags to the hash bucket slots. Even though the number of hash bucket slots is greater than the number of the buffer pool slots, collisions may occur. Therefore, the buffer table uses a separate chaining with linked lists method to resolve collisions. When data entries are mapped to the same bucket slot, this method stores the entries in the same linked list, as shown in Fig. 8.4.

创建Buffer Table的代码如下所示，可参见postgresql数据库数据结构——创建动态哈希表，查看哈希表创建的原理和PostgreSQL数据库共享内存——小管家InitShmemIndex函数中的ShmemInitHash函数。

void InitBufTable(int size) {
  HASHCTL    info;
  /* BufferTag maps to Buffer */
  info.keysize = sizeof(BufferTag);
  info.entrysize = sizeof(BufferLookupEnt);
  info.num_partitions = NUM_BUFFER_PARTITIONS;
  SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table", size, size, &info, HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
}

下面介绍Buffer Table工作流程：

当读取关系表或索引页时，后端进程发送包含数据页的buffer_tag请求给缓冲区管理器。
缓冲区管理器返回存储了数据页的slot的buffer_ID.。如果请求页不在缓冲池中，缓冲区管理器或从磁盘中加载数据页，并将其放入缓冲区slot的中，并返回该slot的buffer_ID。
后端进程访问buffer_ID代表的slot。

(1) When reading a table or index page, a backend process sends a request that includes the page’s buffer_tag to the buffer manager.

(2) The buffer manager returns the buffer_ID of the slot that stores the requested page. If the requested page is not stored in the buffer pool, the buffer manager loads the page from persistent storage to one of the buffer pool slots and then returns the buffer_ID’s slot.

(3) The backend process accesses the buffer_ID’s slot (to read the desired page).

When a backend process modifies a page in the buffer pool (e.g., by inserting tuples), the modified page, which has not yet been flushed to storage, is referred to as a dirty page.

缓冲区描述符层

buffer_id类似于取件码，缓冲区描述符相当于使用取件码查询出来的货品存储信息（请想象这是一个大型的菜鸟驿站，使用机器人完成货物到货架的传送，因此需要到驿站查询货品存储信息），缓冲区的slot相当于货架，货品存储信息中会显示货品所处货架。

Buffer Descriptor

缓冲区描述符表保存了存储在相应缓冲池slot中的数据页的元数据。其中，tag保存了数据页的BufferTag；buffer_id保存了用于标识缓冲区描述符的buffer_id；refcount标识目前有多少PG进程访问该数据页，也被称为pin计数；usage_count保存了从加载到相应缓冲区slot中后已经被访问的次数；context_lock和io_in_progress_lock是轻量级锁；flag存储了数据页的状态。

typedef struct BufferDesc {
  BufferTag  tag;      /* ID of page contained in buffer */
  int      buf_id;      /* buffer's index number (from 0) */
  /* state of the tag, containing flags, refcount and usagecount */
  pg_atomic_uint32 state;
  int      wait_backend_pid;  /* backend PID of pin-count waiter */
  int      freeNext;    /* link in freelist chain */
  LWLock    content_lock;  /* to lock access to buffer contents */
} BufferDesc;

Buffer descriptor holds the metadata of the stored page in the corresponding buffer pool slot. The buffer descriptor structure is defined by the structure BufferDesc. While this structure has many fields, mainly ones are shown in the following:

tag holds the buffer_tag of the stored page in the corresponding buffer pool slot (buffer tag is defined in Section 8.1.2).
buffer_id identifies the descriptor (equivalent to the buffer_id of the corresponding buffer pool slot).
refcount holds the number of PostgreSQL processes currently accessing the associated stored page. It is also referred to as pin count. When a PostgreSQL process accesses the stored page, its refcount must be incremented by 1 (refcount++). After accessing the page, its refcount must be decreased by 1 (refcount–). When the refcount is zero, i.e. the associated stored page is not currently being accessed, the page is unpinned; otherwise it is pinned.
usage_count holds the number of times the associated stored page has been accessed since it was loaded into the corresponding buffer pool slot. Note that usage_count is used in the page replacement algorithm (Section 8.4.4).
context_lock and io_in_progress_lock are light-weight locks that are used to control access to the associated stored page. These fields are described in Section 8.3.2.
flags can hold several states of the associated stored page. The main states are as follows:

dirty bit indicates whether the stored page is dirty.

valid bit indicates whether the stored page can be read or written (valid). For example, if this bit is valid, then the corresponding buffer pool slot stores a page and this descriptor (valid bit) holds the page metadata; thus, the stored page can be read or written. If this bit is invalid, then this descriptor does not hold any metadata; this means that the stored page cannot be read or written or the buffer manager is replacing the stored page.

io_in_progress bit indicates whether the buffer manager is reading/writing the associated page from/to storage. In other words, this bit indicates whether a single process holds the io_in_progress_lock of this descriptor.
freeNext is a pointer to the next descriptor to generate a freelist, which is described in the next subsection.

缓冲区描述符层是描述符的数组，每个描述符一一对应于缓冲区的slot。当数据库启动时，缓冲区描述符都是空的，且这些描述符都会链接起来形成空闲链表（利用上述描述符结构体中的freeNext成员）。

The buffer descriptors layer is an array of buffer descriptors. Each descriptor has one-to-one correspondence to a buffer pool slot and holds metadata of the stored page in the corresponding slot. A collection of buffer descriptors forms an array. In this document, the array is referred to as the buffer descriptors layer. When the PostgreSQL server starts, the state of all buffer descriptors is empty. In PostgreSQL, those descriptors comprise a linked list called freelist (Fig. 8.5).

下面通过获取数据页来描述缓冲区描述符层是和对应于缓冲区的slot：

从空闲链表中取得空的缓冲区描述符，然后增加refcount和usage_count；向缓冲区表中添加条目（buf tag对应于buffer id）；从存储中将数据页加载到相应的缓冲池slot；将新数据页的元数据保存到缓冲区描述符中。

(1) Retrieve an empty descriptor from the top of the freelist, and pin it (i.e. increase its refcount and usage_count by 1).

(2) Insert the new entry, which holds the relation between the tag of the first page and the buffer_id of the retrieved descriptor, in the buffer table.

(3) Load the new page from storage to the corresponding buffer pool slot.

(4) Save the metadata of the new page to the retrieved descriptor.

从空闲链表中取出的描述缓冲区描述符永远会存放数据页元数据。只有在如下场景下才会将描述符放回空闲链表：关系表或索引被删除；数据库被删除；关系表或索引使用VACUUM FULL清理。

Descriptors that have been retrieved from the freelist always hold page’s metadata. In other words, non-empty descriptors continue to be used do not return to the freelist. However, related descriptors are added to the freelist again and the descriptor state becomes ‘empty’ when one of the following occurs:

Tables or indexes have been dropped.
Databases have been dropped.
Tables or indexes have been cleaned up using the VACUUM FULL command.

缓冲池层

缓冲池是数组，每个slot存放一个数据文件页（包括数据表、索引、freespace maps、visibility maps）。用buffer_ids来标识每个缓冲池的slot。The buffer pool layer stores data file pages, such as tables and indexes, as well as freespace maps and visibility maps. The buffer pool is an array, i.e., each slot stores one page of a data file. Indices of a buffer pool array are referred to as buffer_ids.

缓冲区初始化

InitBufferPool函数用于初始化shared buffer池，只在共享内存初始化时（postmaster或standalone backend）调用一次。对共享缓冲区的初始化实际上是对两个全局变量进行初始化：第一个是缓冲区描述符数组BufferDescriptors，该数组存储了所有共享缓冲区的描述符；第二个是指针BufferBlocks，全局变量BufferBlocks指向共享缓冲池头部。NBuffers被定义为1000，通过InitBufferPool函数我们可以计算出缓冲区管理器中的每层在共享内存中占用的大小。

组件	大小
Buffer Descriptor	NBuffers * sizeof(BufferDescPadded)
Buffer Pool	NBuffers * (Size) BLCKSZ
Buffer IO Lock	NBuffers * (Size) BLCKSZ
Checkpoint BufferIds	NBuffers * sizeof(CkptSortItem)

void InitBufferPool(void) {
  bool    foundBufs, foundDescs, foundIOLocks, foundBufCkpt;
  /* Align descriptors to a cacheline boundary. */
  BufferDescriptors = (BufferDescPadded *)ShmemInitStruct("Buffer Descriptors", NBuffers * sizeof(BufferDescPadded), &foundDescs);
  BufferBlocks = (char *)ShmemInitStruct("Buffer Blocks", NBuffers * (Size) BLCKSZ, &foundBufs);
  /* Align lwlocks to cacheline boundary */
  BufferIOLWLockArray = (LWLockMinimallyPadded *)ShmemInitStruct("Buffer IO Locks", NBuffers * (Size) sizeof(LWLockMinimallyPadded), &foundIOLocks);
  LWLockRegisterTranche(LWTRANCHE_BUFFER_IO_IN_PROGRESS, "buffer_io");
  LWLockRegisterTranche(LWTRANCHE_BUFFER_CONTENT, "buffer_content");

  /* The array used to sort to-be-checkpointed buffer ids is located in shared memory, to avoid having to allocate significant amounts of memory at runtime. As that'd be in the middle of a checkpoint, or when the checkpointer is restarted, memory allocation failures would be painful. */
  CkptBufferIds = (CkptSortItem *)ShmemInitStruct("Checkpoint BufferIds", NBuffers * sizeof(CkptSortItem), &foundBufCkpt);

  if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt) {
    /* should find all of these, or none of them */
    Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
    /* note: this path is only taken in EXEC_BACKEND case */
  } else {
    int      i;

初始化缓冲区描述符的头部，缓冲区描述符GetContent锁、缓冲区描述符GetIO锁。

/* Initialize all the buffer headers. */
    for (i = 0; i < NBuffers; i++) {
      BufferDesc *buf = GetBufferDescriptor(i);
      CLEAR_BUFFERTAG(buf->tag);
      pg_atomic_init_u32(&buf->state, 0);
      buf->wait_backend_pid = 0;
      buf->buf_id = i;
      /* Initially link all the buffers together as unused. Subsequent management of this list is done by freelist.c. */
      buf->freeNext = i + 1;
      LWLockInitialize(BufferDescriptorGetContentLock(buf), LWTRANCHE_BUFFER_CONTENT);
      LWLockInitialize(BufferDescriptorGetIOLock(buf), LWTRANCHE_BUFFER_IO_IN_PROGRESS);
    }
    /* Correct last entry of linked list */
    GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
  }

/* Init other shared buffer-management stuff */
  StrategyInitialize(!foundDescs);
  /* Initialize per-backend file flush context */
  WritebackContextInit(&BackendWritebackContext, &backend_flush_after);
}

PostgreSQL数据库缓冲区管理器——概述

缓冲区管理器结构

缓冲区表层

缓冲区描述符层

缓冲池层

缓冲区初始化

继续阅读

Kafka：Topic概念与API介绍

5G小型蜂应用指南

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql优化

线程通信和进程通信区别（线程进程区别）

Matlab随机波动率SV、GARCH用MCMC马尔可夫链蒙特卡罗方法分析汇率时间序列

微信小程序前端解密获取用户信息

Spring MVC 自学杂记（五） -- SpringMVC与前台的json数据交互

《MySQL技术内幕：InnoDB存储引擎》笔记

扩容TIKV节点遇到的坑

PHP辅导代做编程：CS353 Database System

自学Zabbix3.10.2-事件通知Notifications upon events-Actions报警配置点击返回：自学zabbix集锦

HDU 5678 ztr loves trees

拓端tecdat|R语言弹性网络Elastic Net正则化惩罚回归模型交叉验证可视化

二叉树及其应用--二叉树创建

详解STM32单片机的堆栈