天天看点

PostgreSQL数据库缓冲区管理器——共享缓冲区查询

首先在共享缓冲区表中,根据BufferTag进行查询,若缓冲区表层中存在记录说明请求的文件块已经在缓冲池中,直接返回缓冲区。当缓冲区表层中不存在时,需要从buffer池中找到一个空闲的缓冲区来装入文件块,若存在空闲的缓冲区,则返回该缓冲区;若不存在,则使用替换机制进行替换缓冲区,最后获得一个空闲缓冲区并返回。从磁盘中将文件块读入返回的缓冲区,并将该缓冲区记录到缓冲区表层中。

函数ReadBuffer_common是所有缓冲区读取的通用函数,该函数定义了本地缓冲区和共享缓冲区的通用读取方法。ReadBuffer_common的参数中包括前面提到的组成BufferTag的三个信息:表文件的RelFileNode、文件类型和块号,其中RelFileNode并没有作为一个单独的参数传入,而是作为表的SMgrRelation描述符的一部分传入。

该函数使用一个布尔类型的参数hit以区分是在本地缓冲区还是在共享缓冲池中查找缓冲区。(*hit is set to true if the request was satisfied from shared buffer cache.)返回值 是包含所要求文件块的buffer id。

PostgreSQL数据库缓冲区管理器——共享缓冲区查询

relpersistence:传入BufferAlloc(smgr, relpersistence, forkNum, blockNum, strategy, &found)。

strategy:传入BufferAlloc(smgr, relpersistence, forkNum, blockNum, strategy, &found)。

mode:​

​RBM_NORMAL、mode == RBM_NORMAL_NO_LOG 、mode == RBM_ZERO_ON_ERROR​

ReadBuffer_common首先判别传入blockNum是否是P_NEW,如果是就需要调用smgrnblocks替换其为合适的blockNum。接下来的逻辑是判别SmgrIsTemp(smgr)是否为本地Buf,如果是则调用LocalBufferAlloc函数;否则调用BufferAlloc函数来获取指定的共享缓冲区。BufferAlloc函数是执行缓冲区替换策略的核心函数,后续博客中详细解释。

static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, bool *hit) {
  BufferDesc *bufHdr;
  Block    bufBlock;
  bool    found, isExtend, isLocalBuf = SmgrIsTemp(smgr);
  *hit = false;
  /* Make sure we will have room to remember the buffer pin */
  ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  isExtend = (blockNum == P_NEW);
  TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, smgr->smgr_rnode.node.relNode, smgr->smgr_rnode.backend, isExtend);
  /* Substitute proper block number if caller asked for P_NEW */
  if (isExtend) blockNum = smgrnblocks(smgr, forkNum);
  if (isLocalBuf) {
    bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
    if (found) pgBufferUsage.local_blks_hit++;
    else if (isExtend) pgBufferUsage.local_blks_written++;
    else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || mode == RBM_ZERO_ON_ERROR) pgBufferUsage.local_blks_read++;
  } else {
    /* lookup the buffer.  IO_IN_PROGRESS is set if the requested block is not currently in memory. */
    bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum, strategy, &found);
    if (found) pgBufferUsage.shared_blks_hit++;
    else if (isExtend) pgBufferUsage.shared_blks_written++;
    else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || mode == RBM_ZERO_ON_ERROR) pgBufferUsage.shared_blks_read++;
  }      

如果通过LocalBufferAlloc函数或BufferAlloc函数找到相应的块。如果传入blockNum不是P_NEW,调用BufferDescriptorGetBuffer(bufHdr)获取buffer id。

/* At this point we do NOT hold any locks. */
  /* if it was already in the buffer pool, we're done */
  if (found) {
    if (!isExtend) {
      /* Just need to update stats before we exit */
      *hit = true;
      VacuumPageHit++;
      if (VacuumCostActive) VacuumCostBalance += VacuumCostPageHit;
      TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, smgr->smgr_rnode.node.relNode, smgr->smgr_rnode.backend, isExtend, found);
      /* In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked on return. */
      if (!isLocalBuf) {
        if (mode == RBM_ZERO_AND_LOCK) LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
        else if (mode == RBM_ZERO_AND_CLEANUP_LOCK) LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
      }
      return BufferDescriptorGetBuffer(bufHdr);
    }      

如果通过LocalBufferAlloc函数或BufferAlloc函数找到相应的块。如果传入blockNum是P_NEW,以下代码为了处理当我们在extend关系表示却发现已经存在被标记位BM_VALID的buffer。

/*
     * We get here only in the corner case where we are trying to extend
     * the relation but we found a pre-existing buffer marked BM_VALID.
     * This can happen because mdread doesn't complain about reads beyond
     * EOF (when zero_damaged_pages is ON) and so a previous attempt to
     * read a block beyond EOF could have left a "valid" zero-filled
     * buffer.  Unfortunately, we have also seen this case occurring
     * because of buggy Linux kernels that sometimes return an
     * lseek(SEEK_END) result that doesn't account for a recent write. In
     * that situation, the pre-existing buffer would contain valid data
     * that we don't want to overwrite.  Since the legitimate case should
     * always have left a zero-filled buffer, complain if not PageIsNew.
     */
    bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
    if (!PageIsNew((Page) bufBlock))
      ereport(ERROR,(errmsg("unexpected data beyond EOF in block %u of relation %s",blockNum, relpath(smgr->smgr_rnode, forkNum)),errhint("This has been seen to occur with buggy kernels; consider updating your system.")));

    /*
     * We *must* do smgrextend before succeeding, else the page will not
     * be reserved by the kernel, and the next P_NEW call will decide to
     * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
     * call that BufferAlloc didn't, and proceed.
     */
    if (isLocalBuf) {
      /* Only need to adjust flags */
      uint32    buf_state = pg_atomic_read_u32(&bufHdr->state);
      Assert(buf_state & BM_VALID);
      buf_state &= ~BM_VALID;
      pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
    } else {
      /*
       * Loop to handle the very small possibility that someone re-sets
       * BM_VALID between our clearing it and StartBufferIO inspecting
       * it.
       */
      do {
        uint32    buf_state = LockBufHdr(bufHdr);
        Assert(buf_state & BM_VALID);
        buf_state &= ~BM_VALID;
        UnlockBufHdr(bufHdr, buf_state);
      } while (!StartBufferIO(bufHdr, true));
    }
  }      

到这里,已经为数据页分配好了buffer,但是数据页的内容还是无效的,需要读入数据页。如果传入blockNum是P_NEW,则使用零填充buffer,调用smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);否则直接调用smgrread读入数据页,但是调用者可可以仅仅只想分配buffer。

/*
   * if we have gotten to this point, we have allocated a buffer for the
   * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
   * if it's a shared buffer.
   *
   * Note: if smgrextend fails, we will end up with a buffer that is
   * allocated but not marked BM_VALID.  P_NEW will still select the same
   * block number (because the relation didn't get any longer on disk) and
   * so future attempts to extend the relation will find the same buffer (if
   * it's not been recycled) but come right back here to try smgrextend
   * again.
   */
  Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));  /* spinlock not needed */
  bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
  if (isExtend) {
    /* new buffers are zero-filled */
    MemSet((char *) bufBlock, 0, BLCKSZ);
    /* don't set checksum for all-zero page */
    smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
    /*
     * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
     * although we're essentially performing a write. At least on linux
     * doing so defeats the 'delayed allocation' mechanism, leading to
     * increased file fragmentation.
     */
  } else {
    /*
     * Read in the page, unless the caller intends to overwrite it and
     * just wants us to allocate a buffer.
     */
    if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
      MemSet((char *) bufBlock, 0, BLCKSZ);
    else {
      instr_time  io_start, io_time;
      if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start);
      smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
      if (track_io_timing) {
        INSTR_TIME_SET_CURRENT(io_time);
        INSTR_TIME_SUBTRACT(io_time, io_start);
        pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
        INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
      }
      /* check for garbage data */
      if (!PageIsVerifiedExtended((Page) bufBlock, blockNum, PIV_LOG_WARNING | PIV_REPORT_STAT)) {
        if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages) {
          ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("invalid page in block %u of relation %s; zeroing out page", blockNum, relpath(smgr->smgr_rnode, forkNum))));
          MemSet((char *) bufBlock, 0, BLCKSZ);
        } else
          ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("invalid page in block %u of relation %s", blockNum, relpath(smgr->smgr_rnode, forkNum))));
      }
    }
  }

  /*
   * In RBM_ZERO_AND_LOCK mode, grab the buffer content lock before marking
   * the page as valid, to make sure that no other backend sees the zeroed
   * page before the caller has had a chance to initialize it.
   *
   * Since no-one else can be looking at the page contents yet, there is no
   * difference between an exclusive lock and a cleanup-strength lock. (Note
   * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
   * they assert that the buffer is already valid.)
   */
  if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) && !isLocalBuf)
    LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
  if (isLocalBuf) {
    /* Only need to adjust flags */
    uint32    buf_state = pg_atomic_read_u32(&bufHdr->state);
    buf_state |= BM_VALID;
    pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
  } else {
    /* Set BM_VALID, terminate IO, and wake up any waiters */
    TerminateBufferIO(bufHdr, false, BM_VALID);
  }
  VacuumPageMiss++;
  if (VacuumCostActive) VacuumCostBalance += VacuumCostPageMiss;
  TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, smgr->smgr_rnode.node.relNode, smgr->smgr_rnode.backend, isExtend, found);
  return BufferDescriptorGetBuffer(bufHdr);
}      

访问存储在buffer池中的数据页

First, the simplest case is described, i.e. the desired page is already stored in the buffer pool. In this case, the buffer manager performs the following steps:

(1) Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_C’) and compute the hash bucket slot, which contains the associated entry of the created buffer_tag, using the hash function.

(2) Acquire the BufMappingLock partition that covers the obtained hash bucket slot in shared mode (this lock will be released in step (5)).

(3) Look up the entry whose tag is ‘Tag_C’ and obtain the buffer_id from the entry. In this example, the buffer_id is 2.

(4) Pin the buffer descriptor for buffer_id 2, i.e. the refcount and usage_count of the descriptor are increased by 1 ( Section 8.3.2 describes pinning).

(5) Release the BufMappingLock.

(6) Access the buffer pool slot with buffer_id 2.

PostgreSQL数据库缓冲区管理器——共享缓冲区查询

Then, when reading rows from the page in the buffer pool slot, the PostgreSQL process acquires the shared content_lock of the corresponding buffer descriptor. Thus, buffer pool slots can be read by multiple processes simultaneously.

When inserting (and updating or deleting) rows to the page, a Postgres process acquires the exclusive content_lock of the corresponding buffer descriptor (note that the dirty bit of the page must be set to ‘1’).

After accessing the pages, the refcount values of the corresponding buffer descriptors are decreased by 1.

8.4.2. Loading a Page from Storage to Empty Slot

In this second case, assume that the desired page is not in the buffer pool and the freelist has free elements (empty descriptors). In this case, the buffer manager performs the following steps:

(1) Look up the buffer table (we assume it is not found).

  1. Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_E’) and compute the hash bucket slot.
  2. Acquire the BufMappingLock partition in shared mode.
  3. Look up the buffer table (not found according to the assumption).
  4. Release the BufMappingLock.

    (2) Obtain the empty buffer descriptor from the freelist, and pin it. In this example, the buffer_id of the obtained descriptor is 4.

    (3) Acquire the BufMappingLock partition in exclusive mode (this lock will be released in step (6)).

    (4) Create a new data entry that comprises the buffer_tag ‘Tag_E’ and buffer_id 4; insert the created entry to the buffer table.

    (5) Load the desired page data from storage to the buffer pool slot with buffer_id 4 as follows:

  5. Acquire the exclusive io_in_progress_lock of the corresponding descriptor.
  6. Set the io_in_progress bit of the corresponding descriptor to '1 to prevent access by other processes.
  7. Load the desired page data from storage to the buffer pool slot.
  8. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘0’, and the valid bit is set to ‘1’.
  9. Release the io_in_progress_lock.

    (6) Release the BufMappingLock.

    (7) Access the buffer pool slot with buffer_id 4.

PostgreSQL数据库缓冲区管理器——共享缓冲区查询

8.4.3. Loading a Page from Storage to a Victim Buffer Pool Slot

In this case, assume that all buffer pool slots are occupied by pages but the desired page is not stored. The buffer manager performs the following steps:

(1) Create the buffer_tag of the desired page and look up the buffer table. In this example, we assume that the buffer_tag is ‘Tag_M’ (the desired page is not found).

(2) Select a victim buffer pool slot using the clock-sweep algorithm, obtain the old entry, which contains the buffer_id of the victim pool slot, from the buffer table and pin the victim pool slot in the buffer descriptors layer. In this example, the buffer_id of the victim slot is 5 and the old entry is ‘Tag_F, id=5’. The clock sweep is described in the next subsection.

(3) Flush (write and fsync) the victim page data if it is dirty; otherwise proceed to step (4).

The dirty page must be written to storage before overwriting with new data. Flushing a dirty page is performed as follows:

  1. Acquire the shared content_lock and the exclusive io_in_progress lock of the descriptor with buffer_id 5 (released in step 6).
  2. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘1’ and the just_dirtied bit is set to ‘0’.
  3. Depending on the situation, the XLogFlush() function is invoked to write WAL data on the WAL buffer to the current WAL segment file (details are omitted; WAL and the XLogFlush function are described in Chapter 9).
  4. Flush the victim page data to storage.
  5. Change the states of the corresponding descriptor; the io_in_progress bit is set to ‘0’ and the valid bit is set to ‘1’.
  6. Release the io_in_progress and content_lock locks.

    (4) Acquire the old BufMappingLock partition that covers the slot that contains the old entry, in exclusive mode.

    (5) Acquire the new BufMappingLock partition and insert the new entry to the buffer table:

  7. Create the new entry comprised of the new buffer_tag ‘Tag_M’ and the victim’s buffer_id.
  8. Acquire the new BufMappingLock partition that covers the slot containing the new entry in exclusive mode.
  9. Insert the new entry to the buffer table.
PostgreSQL数据库缓冲区管理器——共享缓冲区查询

(6) Delete the old entry from the buffer table, and release the old BufMappingLock partition.

(7) Load the desired page data from the storage to the victim buffer slot. Then, update the flags of the descriptor with buffer_id 5; the dirty bit is set to '0 and initialize other bits.

(8) Release the new BufMappingLock partition.