MySQL · 引擎特性 · IO_CACHE 源碼解析

概述

在資料庫中 IO 的重要性不言而喻，為了更好的管理 IO 操作，大多數資料庫都自己管理頁資料和刷髒機制（例如 InnoDB 中的 Buffer pool），而不是交給檔案系統甚至是作業系統排程。但是對于順序寫入的日志資料，使用檔案系統接口友善的多，

檔案系統

也是以頁的形式管理，呈現給應用層的是一片連續可寫的空間，管理的機關稱為 Sector 大小是 4KB，是以對于 4KB 對齊的位址讀寫可以避免跨多個 Sector，對檔案系統的性能有很大的提高。MySQL 中的 IO_CACHE 的作用就是把連續的檔案讀寫操作，經過緩沖，轉化為 4K 對齊的檔案讀寫操作。

如圖所示，對于檔案的讀寫操作如果小于 IO_CACHE 大小，就放到緩沖中，當 IO_CACHE 滿了就進行一次 4KB 對齊的寫入，如果一次讀寫超過 IO_CACHE 的大小，就把 4K 對齊的資料進行一次讀寫，剩餘部分放到 IO_CACHE 中，等待下次讀寫一起合并。

源碼解析

IO_CACHE 有不同的類型，定義在 cache_type 中：

enum cache_type
{
  TYPE_NOT_SET= 0, READ_CACHE, WRITE_CACHE,
  SEQ_READ_APPEND		/* sequential read or append */,
  READ_FIFO, READ_NET,WRITE_NET};

常用的 general log, slow log, err log, binlog 主要使用 READ_CACHE, WRITE_CACHE, SEQ_READ_APPEND 幾種類型，本文主要介紹這幾種。同時 IO_CACHE 也提供支援 AIO 的接口，支援多線程同時通路 IO_CACHE 等，目前來看來應用也不多，暫不涉及。

主要代碼在 mysys/mf_iocache.c 中，

READ_CACHE 是讀緩沖，WRITE_CACHE 是寫緩沖，SEQ_READ_APPEND 同時支援讀寫，寫線程不斷 append 資料到檔案尾，讀線程去 read 資料。append 使用 IO_CACHE::write_buffer, read 使用 IO_CACHE::buffer。當讀到 write_buffer 中的資料時，就從 write_buffer 中拿資料。SEQ_READ_APPEND 這種類型在 MySQL 複制子產品使用，IO 線程負責 append 資料到 relay log，SQL 線程負責 read 出來應用（考慮下為什麼在主庫上的寫入線程和 Dump 線程之間不是使用這種方法，而是簡單的 read-write，因為主庫上 order_commit 函數很可能成為性能的瓶頸，和 Dump 線程競争 append_buffer_lock 似乎并不好），因為 SEQ_READ_APPEND 類型更具有代表性，就以這種類型為例介紹。

基礎資料結構

基本的結構是 IO_CACHE，代碼中注釋寫的比較清楚，這裡貼一下友善後面看，

typedef struct st_io_cache
{
  /* Offset in file corresponding to the first byte of uchar* buffer. */
  my_off_t pos_in_file;
  /*
    The offset of end of file for READ_CACHE and WRITE_CACHE.
    For SEQ_READ_APPEND it the maximum of the actual end of file and
    the position represented by read_end.
  */
  my_off_t end_of_file;
  /* Points to current read position in the buffer */
  uchar	*read_pos;
  /* the non-inclusive boundary in the buffer for the currently valid read */
  uchar  *read_end;
  uchar  *buffer;				/* The read buffer */
  /* Used in ASYNC_IO */
  uchar  *request_pos;

  /* Only used in WRITE caches and in SEQ_READ_APPEND to buffer writes */
  uchar  *write_buffer;
  /*
    Only used in SEQ_READ_APPEND, and points to the current read position
    in the write buffer. Note that reads in SEQ_READ_APPEND caches can
    happen from both read buffer (uchar* buffer) and write buffer
    (uchar* write_buffer).
  */
  uchar *append_read_pos;
  /* Points to current write position in the write buffer */
  uchar *write_pos;
  /* The non-inclusive boundary of the valid write area */
  uchar *write_end;

  /*
    Current_pos and current_end are convenience variables used by
    my_b_tell() and other routines that need to know the current offset
    current_pos points to &write_pos, and current_end to &write_end in a
    WRITE_CACHE, and &read_pos and &read_end respectively otherwise
  */
  uchar  **current_pos, **current_end;

  /*
    The lock is for append buffer used in SEQ_READ_APPEND cache
    need mutex copying from append buffer to read buffer.
  */
  mysql_mutex_t append_buffer_lock;

  /*
    A caller will use my_b_read() macro to read from the cache
    if the data is already in cache, it will be simply copied with
    memcpy() and internal variables will be accordinging updated with
    no functions invoked. However, if the data is not fully in the cache,
    my_b_read() will call read_function to fetch the data. read_function
    must never be invoked directly.
  */
  int (*read_function)(struct st_io_cache *,uchar *,size_t);
  /*
    Same idea as in the case of read_function, except my_b_write() needs to
    be replaced with my_b_append() for a SEQ_READ_APPEND cache
  */
  int (*write_function)(struct st_io_cache *,const uchar *,size_t);
  /*
    Specifies the type of the cache. 
  */
  enum cache_type type;
  /*
    Callbacks when the actual read I/O happens. These were added and
    are currently used for binary logging of LOAD DATA INFILE - when a
    block is read from the file, we create a block create/append event, and
    when IO_CACHE is closed, we create an end event. These functions could,
    of course be used for other things
  */
  IO_CACHE_CALLBACK pre_read;
  IO_CACHE_CALLBACK post_read;
  IO_CACHE_CALLBACK pre_close;
  /*
    Counts the number of times, when we were forced to use disk. We use it to
    increase the binlog_cache_disk_use and binlog_stmt_cache_disk_use status
    variables.
  */
  ulong disk_writes;
  void* arg;				/* for use by pre/post_read */
  char *file_name;			/* if used with 'open_cached_file' */
  char *dir,*prefix;
  File file; /* file descriptor */
  /*
    seek_not_done is set by my_b_seek() to inform the upcoming read/write
    operation that a seek needs to be preformed prior to the actual I/O
    error is 0 if the cache operation was successful, -1 if there was a
    "hard" error, and the actual number of I/O-ed bytes if the read/write was
    partial.
  */
  int	seek_not_done,error;
  /* buffer_length is memory size allocated for buffer or write_buffer */
  size_t	buffer_length;
  /* read_length is the same as buffer_length except when we use async io */
  size_t  read_length;
  myf	myflags;			/* Flags used to my_read/my_write */
  /*
    alloced_buffer is 1 if the buffer was allocated by init_io_cache() and
    0 if it was supplied by the user.
    Currently READ_NET is the only one that will use a buffer allocated
    somewhere else
  */
  my_bool alloced_buffer;
} IO_CACHE;

初始化

初始化函數是 init_io_cache ，主要會做以下幾件事：

和對應的檔案描述符綁定，初始化 IO_CACHE 中各種變量。
配置設定 write_buffer 和 read_buffer 的空間。
初始化互斥變量 append_buffer_lock. (對于 SEQ_READ_APPEND 類型而言)
init_functions 初始化對應的檔案讀寫函數。

其中根據傳入的參數 cache_size 配置設定緩沖空間，一般傳入的空間都不算大，例如 Binlog 的 IO_CACHE 初始化傳入的大小就是 IO_SIZE（4KB），因為檔案系統本身是有 page cache 的，隻有調用 fsync 操作才會保證資料落盤，是以 IO_CACHE 就沒必要緩沖太多的資料，隻做把資料對齊寫入的活。但并不是傳進來多大空間就配置設定多大空間，看下代碼：

min_cache=use_async_io ? IO_SIZE*4 : IO_SIZE*2;

cachesize= ((cachesize + min_cache-1) & ~(min_cache-1));
for (;;)
{
	if (cachesize < min_cache)
		cachesize = min_cache;
   buffer_block= cachesize;
   if (type == SEQ_READ_APPEND)
		buffer_block *= 2;
	
	if ((info->buffer= (uchar*) my_malloc(buffer_block, flags)) != 0)
   {
		info->write_buffer=info->buffer;
		if (type == SEQ_READ_APPEND)
	  		info->write_buffer = info->buffer + cachesize;
		info->alloced_buffer=1;
		break;					/* Enough memory found */
   }
   if (cachesize == min_cache)
		DBUG_RETURN(2);				/* Can't alloc cache */
   /* Try with less memory */
      cachesize= (cachesize*3/4 & ~(min_cache-1));
}

最小的配置設定空間在不使用 AIO 的情況下是 8K，這個後面會用到，SEQ_READ_APPEND 類型會配置設定兩倍空間，因為有讀緩沖和寫緩沖。如果申請的空間無法滿足就試圖申請小一點的空間。

init_functions 是根據 IO_CACHE 的類型初始化 IO_CACHE::read_function 和 IO_CACHE::write_function，當緩沖大小沒法滿足檔案 IO 請求的時候就會調用這兩個函數去檔案中交換資料。

case SEQ_READ_APPEND:
    info->read_function = _my_b_seq_read;
    info->write_function = 0;			/* Force a core if used */
    break;
default:
    info->read_function = info->share ? _my_b_read_r : _my_b_read;
    info->write_function = _my_b_write;
  }

SEQ_READ_APPEND 的寫直接調用 my_b_append。

調用接口

主要的接口在 include/my_sys.h 檔案中，大多是宏定義形式。簡單看幾個常用的：

#define my_b_read(info,Buffer,Count) \
  ((info)->read_pos + (Count) <= (info)->read_end ?\
   (memcpy(Buffer,(info)->read_pos,(size_t) (Count)), \
    ((info)->read_pos+=(Count)),0) :\
   (*(info)->read_function)((info),Buffer,Count))

從 IO_CACHE info 中讀取 Count 個位元組到 Buffer 中，read_pos 是目前讀到的位置，相對于 IO_CACHE::buffer，read_end 是緩沖區的末尾，這要要注意的是 read_end 相對于 IO_CACHE::buffer 的長度，并不一定是緩沖的長度，因為在讀寫過程中會調整緩沖區大小做 4K 對齊。邏輯比較簡單，如果緩沖區的有效資料長度不夠，那麼就調用 read_function 做檔案 IO。

#define my_b_write(info,Buffer,Count) \
 ((info)->write_pos + (Count) <=(info)->write_end ?\
  (memcpy((info)->write_pos, (Buffer), (size_t)(Count)),\
   ((info)->write_pos+=(Count)),0) : \
   (*(info)->write_function)((info),(uchar *)(Buffer),(Count)))

從 Buffer 中向 IO_CACHE info 寫 Count 個位元組資料，邏輯類似，如果寫入緩沖不夠，就做一次檔案 IO。

#define my_b_tell(info) ((info)->pos_in_file + \
			 (size_t) (*(info)->current_pos - (info)->request_pos))

這裡 request_pos 是指向 IO_CACHE::buffer 的，而 current_pos 在 setup_io_cache 中初始化為 read_pos 或者 write_pos, 這種設計就可以為不同的 cache type 提供統一的接口。

還有一些非宏定義的接口比如 my_b_seek 等在檔案 mysys_iocache2.c 中，不一一介紹，總之檔案系統常用的操作在 IO_CACHE 中基本都可以找到。

_my_b_seq_read

以 SEQ_READ_APPEND 類型為例，檔案 IO 的函數是 _my_b_seq_read, 整個流程分為三個階段：

read from info->buffer
read from file description
try append buffer

因為 SEQ_READ_APPEND 類型的讀可能會讀到 info->write_buffer 中還沒來及寫到檔案系統裡的資料，是以第三步就是去寫緩沖中讀。整個代碼的精髓在于計算需要讀多少資料才能保證對齊，看下代碼:

// 先把 IO_CACHE 裡剩下的資料讀到 Buffer 裡
if ((left_length=(size_t) (info->read_end-info->read_pos))
{
    memcpy(Buffer, info->read_pos, left_length);
    Buffer+=left_length;
    Count-=left_length;
}
//更新 pos_in_file, 如果更新之後超出了 end_of_file, 就去 append_buffer 中讀取。
if (pos_in_file=info->pos_in_file +
    (size_t)(info->read_end - info->buffer)) > info->end_of_file)
    goto read_append_buffer;

// diff_length 為了對齊讀
diff_length= (size_t)(pos_in_file &(IO_SIZE-1));

// 第二階段，從檔案裡讀資料
// 一般 IO_CACHE 預設初始化是 2*IO_CACHE，8KB，這個意思是 Count 的大小已經不能放在一個 IO_CACHE
// 的 Buffer 裡
if (Count >= (size_t)(IO_SIZE + (IO_SIZE - diff_length)
{
    // 到這裡面說明 Count 要讀的資料超過了 IO_CACHE 中的 Buffer 大小，直接讀到 Buffer
    // 那麼讀多少比較合适呢？
// 取出高階的 IO_CACHE，整數個。(Count & (size_t)~(IO_SIZAE-1))
// 但是因為 pos_in_file 相對于 4K 對齊位址還有一定的偏移量，再減去這個偏移，保證整個讀取是對齊的
    length=(Count & (size_t)~(IO_SIZE-1))-diff_Lenght;
    if (read_length=mysql_file_read(info->file, Buffer, length..){}
    // update after read
    Count -= read_lenght;
    Buffer += read_leagth;
    pos_in_file += read_length;
    if(read_length != length)
        goto read_append_buffer; // 沒有讀到想要的長度
    left_length += length;
    diff_length=0;  // no diff length now
}

// IO_CACHE buffer 中還可以讀多少資料。
max_length= info->read_length-diff_length;
// 可能會超出檔案結尾，需要到 append buffer 讀取
if (max_length > (info->end_of_file - pos_in_file)
    max_length= (size_t)(info->end_of_file - pos_in_file)
if (!max_length) // 已經到了檔案尾
{
    if (Count) // 如果還有東西要讀
        goto read_append_buffer; 去 append buffer 讀
}else // 還可以讀一些東西
{
    // 讀到 info->buffer 裡，max_length 要麼讀到真實檔案尾，要麼讀到 read buffer的盡頭
    length= mysql_file_read(info->file, info->bufffer, max_length);
    if (lenth < Count) 還有東西要讀
    {
        goto read_append_buffer;
    }
}     

return 0；

read_append_buffer:
{
    // 先看 append buffer

MySQL · 引擎特性 · IO_CACHE 源碼解析

概述

源碼解析

基礎資料結構

初始化

調用接口

_my_b_seq_read

繼續閱讀

2022秋招面試總結（cpp+java+測開）百度測開一面位元組後端一面蝦皮後端一面蝦皮後端二面

資料庫之DDL操作資料庫DDL操作資料庫DDL操作資料表

資料庫之DQL操作資料庫

mysql優化（sql優化）

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

寶塔面闆mysql恢複2018.1.8更新

Centos7 MySQL 5.7 安裝MySQL 5.7 安裝

查找入職員工時間排名倒數第三的員工所有資訊

Hibernate使用Hibernate的“3個準備，7個步驟”Hibernate API簡介操作實體對象對象識别

雲計算面試題——mysql/存儲引擎/備份

SQL語言基礎：常用的資料查詢語句

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

MySQL的4種隔離級别？出現問題

neo4j之cypher使用文檔

mysql使用source指令導入.sql檔案