PostgreSQL Huge Page 使用建議 - 大記憶體主機、執行個體注意

标簽

PostgreSQL , Linux , huge page , shared buffer , page table , 虛拟位址 , 實體位址 , 記憶體位址轉換表

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E8%83%8C%E6%99%AF 背景

當記憶體很大時，除了刷髒頁的排程可能需要優化，還有一方面是虛拟記憶體與實體記憶體映射表相關的部分需要優化。

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#1-%E8%84%8F%E9%A1%B5%E8%B0%83%E5%BA%A6%E4%BC%98%E5%8C%96 1 髒頁排程優化

1、主要包括，調整背景程序刷髒頁的門檻值、喚醒間隔、以及老化門檻值。（髒頁大于多少時開始刷、多久探測一次有多少髒頁、刷時多老的髒頁刷出。）。

vm.dirty_background_bytes = 4096000000         vm.dirty_background_ratio = 0         vm.dirty_expire_centisecs = 6000         vm.dirty_writeback_centisecs = 100

2、使用者程序刷髒頁排程，當髒頁大于多少時，使用者如果要申請記憶體，需要協助刷髒頁。

vm.dirty_bytes = 0         vm.dirty_ratio = 80

《DBA不可不知的作業系統核心參數》

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#2-%E5%86%85%E5%AD%98%E8%A1%A8%E6%98%A0%E5%B0%84%E4%BC%98%E5%8C%96 2 記憶體表映射優化

這部分主要是因為虛拟記憶體管理，Linux需要維護虛拟記憶體位址與實體記憶體的映射關系，為了提升轉換性能，最好這部分能夠cache在cpu的cache裡面。頁越大，映射表就越小。使用huge page可以減少頁表大小。

預設頁大小可以這樣擷取，

# getconf PAGESIZE         4096

https://en.wikipedia.org/wiki/Page_table

另一個使用HUGE PAGE的原因，HUGE PAGE是常駐記憶體的，不會被交換出去，這也是重度依賴記憶體的應用（包括資料庫）非常喜歡的。

In a virtual memory system, the tables store the mappings between virtual addresses and physical addresses. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the HugePages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetbl entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the SGA, and releasing entries that can map other portions of the address space.

With HugePages enabled, the system uses fewer page tables, reducing the overhead for maintaining and accessing them. Huges pages remain pinned in memory and are not replaced, so the kernel swap daemon has no work to do in managing them, and the kernel does not need to perform page table lookups for them. The smaller number of pages reduces the overhead involved in performing memory operations, and also reduces the likelihood of a bottleneck when accessing page tables.

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#postgresql-hugepage%E4%BD%BF%E7%94%A8%E5%BB%BA%E8%AE%AE PostgreSQL HugePage使用建議

1、檢視Linux huage page頁大小

# grep Hugepage /proc/meminfo          Hugepagesize:       2048 kB

2、準備設定多大的shared buffer參數，假設我們的記憶體有512GB，想設定128GB的SHARED BUFFER。

vi postgresql.conf         shared_buffers='128GB'

3、計算需要多少huge page

128GB/2MB=65535

4、設定Linux huge page頁數

sysctl -w vm.nr_hugepages=67537

5、設定使用huge page。

vi $PGDATA/postgresql.conf         huge_pages = on                 # on, off, or try         # 設定為try的話，會先嘗試huge page，如果啟動時無法鎖定給定數目的大頁，則不會使用huge page

6、啟動資料庫

pg_ctl start

7、檢視目前使用了多少huge page

cat /proc/meminfo |grep -i huge         AnonHugePages:      6144 kB         HugePages_Total:   67537  ## 設定的HUGE PAGE         HugePages_Free:    66117  ## 這個是目前剩餘的，但是實際上真正可用的并沒有這麼多，因為被PG鎖定了65708個大頁         HugePages_Rsvd:    65708  ## 啟動PG時申請的HUGE PAGE         HugePages_Surp:        0         Hugepagesize:       2048 kB   ## 目前大頁2M

8、執行一些查詢，可以看到Free會變小。被PG使用掉了。

cat /proc/meminfo |grep -i huge         AnonHugePages:      6144 kB         HugePages_Total:   67537         HugePages_Free:    57482         HugePages_Rsvd:    57073         HugePages_Surp:        0         Hugepagesize:       2048 kB

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#oracle-hugepage%E4%BD%BF%E7%94%A8%E5%BB%BA%E8%AE%AE Oracle HugePage使用建議

Oracle也是重度記憶體使用應用，當SGA配置較大時，同樣建議設定HUGEPAGE。

Oracle 建議當SGA大于或等于8GB時，使用huge page。

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#101-about-hugepages 10.1 About HugePages

The HugePages feature enables the Linux kernel to manage large pages of memory in addition to the standard 4KB (on x86 and x86_64) or 16KB (on IA64) page size. If you have a system with more than 16GB of memory running Oracle databases with a total System Global Area (SGA) larger than 8GB, you should enable the HugePages feature to improve database performance.

Note

The Automatic Memory Management (AMM) and HugePages features are not compatible in Oracle Database 11g and later. You must disable AMM to be able to use HugePages.

The memory allocated to huge pages is pinned to primary storage, and is never paged nor swapped to secondary storage. You reserve memory for huge pages during system startup, and this memory remains allocated until you change the configuration.

Huge pages are 4MB in size on x86, 2MB on x86_64, and 256MB on IA64.

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394 https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E6%B5%8B%E8%AF%95%E5%AF%B9%E6%AF%94%E6%98%AF%E5%90%A6%E4%BD%BF%E7%94%A8hugepage 測試對比是否使用HugePage

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E8%AE%BE%E8%AE%A1test-case 設計test case

建立10240個表，使用merge insert寫入200億資料。

1、建表

do language plpgsql $$         declare         begin           execute 'drop table if exists test';           execute 'create table test(id int8 primary key, info text, crt_time timestamp)';           for i in 0..10239 loop             execute format('drop table if exists test%s', i);             execute format('create table test%s (like test including all)', i);           end loop;         end;         $$;

2、建立動态寫入函數，第一種不使用綁定變量

create or replace function dyn_pre(int8) returns void as $$         declare           suffix int8 := mod($1,10240);         begin           execute format('insert into test%s values(%s, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, $1);         end;         $$ language plpgsql strict;

3、使用綁定變量，性能更好。

create or replace function dyn_pre(int8) returns void as $$         declare           suffix int8 := mod($1,10240);         begin           execute format('execute p%s(%s)', suffix, $1);           exception when others then             execute format('prepare p%s(int8) as insert into test%s values($1, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix);             execute format('execute p%s(%s)', suffix, $1);         end;         $$ language plpgsql strict;

4、建立壓測腳本

vi test.sql         \set id random(1,20000000000)         select dyn_pre(:id);

5、寫入性能壓測

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 1200000

6、多長連接配接壓測，PAGE TABLE觀察

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000         pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#1-%E4%BD%BF%E7%94%A8hugepage 1 使用HugePage

1、小量連接配接寫入性能

transaction type: ./test.sql         scaling factor: 1         query mode: prepared         number of clients: 56         number of threads: 56         duration: 120 s         number of transactions actually processed: 17122345         latency average = 0.392 ms         latency stddev = 0.251 ms         tps = 142657.055512 (including connections establishing)         tps = 142687.784245 (excluding connections establishing)         script statistics:          - statement latencies in milliseconds:                  0.002  \set id random(1,20000000000)                  0.390  select dyn_pre(:id);

2、1900個長連接配接，PAGE TABLE大小（由于是虛拟、實體記憶體映射關系表。是以耗費取決于連接配接數，以及每個連接配接相關聯的SHARED BUFFER以及會話自己的relcache, SYSCACHE）

cat /proc/meminfo |grep -i table         Unevictable:           0 kB         PageTables:       578612 kB  ## shared buffer使用了huge page，這塊省了很多。         NFS_Unstable:          0 kB

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#2-%E6%9C%AA%E4%BD%BF%E7%94%A8hugepage 2 未使用HugePage

sysctl -w vm.nr_hugepages=0

1、小量連接配接的寫入性能

transaction type: ./test.sql         scaling factor: 1         query mode: prepared         number of clients: 56         number of threads: 56         duration: 120 s         number of transactions actually processed: 18484181         latency average = 0.364 ms         latency stddev = 0.212 ms         tps = 153887.936028 (including connections establishing)         tps = 153905.968799 (excluding connections establishing)         script statistics:          - statement latencies in milliseconds:                  0.002  \set id random(1,20000000000)                  0.362  select dyn_pre(:id);

小量連接配接未使用HUGE PAGE性能比使用huge page更好，猜測可能是huge page使用了類似兩級轉換(因為2MB為單個目标的映射，并不能精準定位到預設8K的資料頁的實體記憶體位置。可能和資料庫的索引bitmap scan道理類似，bitmap scan告訴你資料在哪個PAGE内，而不是直接告訴你資料在哪個PAGE的第幾條記錄上。)，導緻了一定的損耗。

cat /proc/meminfo |grep -i table         Unevictable:           0 kB         PageTables:     10956556 kB  ## 不一會就增長到了10GB，因為每個連接配接都在TOUCH shared buffer内的資料，可能導緻映射表很大。連接配接越多。TOUCH shared buffer内資料越多越明顯         # PageTables 還在不斷增長         NFS_Unstable:          0 kB

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#centos-7u-%E9%85%8D%E7%BD%AE%E5%A4%A7%E9%A1%B5%E4%BE%8B%E5%AD%90 CentOS 7u 配置大頁例子

1、修改/boot/grub2/grub.cfg

定位到第一個

menuentry 'CentOS Linux'

，在

linux16 /vmlinuz

最後面添加如下：

說明（關閉透明大頁，使用預設的2MB大頁，你也可以選擇用1G的大頁，但是在此之前應該先到系統中判斷支援哪些大頁規格. 檢視/proc/cpuinfo裡面的FLAG

Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag).

，設定啟動時建立1536個大頁（這部分記憶體會被保留，是以一定要注意設定合适的大小，建議在LINUX啟動後通過sysctl來設定）。）

numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536

transparent_hugepage=never表示關閉透明大頁，以免不必要的麻煩。透明大頁這個特性應該還不太成熟。

hugepagesz 表示頁面大小，2M和1G選其一，預設為2M。

hugepages 表示大頁面數

總共大頁面記憶體量為

hugepagesz * hugepages

，這裡為3G

例子：

menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-d8179b22-8b44-4552-bf2a-04bae2a5f5dd' {               load_video               set gfxpayload=keep               insmod gzio               insmod part_msdos               insmod xfs               set root='hd0,msdos1'               if [ x$feature_platform_search_hint = xy ]; then                 search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  34f87a8d-8b73-4f80-b0ff-8d49b17975ca               else                 search --no-floppy --fs-uuid --set=root 34f87a8d-8b73-4f80-b0ff-8d49b17975ca               fi               linux16 /vmlinuz-3.10.0-693.5.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/root rhgb quiet LANG=en_US.UTF-8 numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536               initrd16 /initramfs-3.10.0-693.5.2.el7.x86_64.img       }

重新開機系統（如果你不想重新開機系統而使用HUGE PAGE，使用這種方法即可

sysctl -w vm.nr_hugepages=1536

）

但是修改預設的大頁規格(2M or 1G)則一定要重新開機，例如：

numa=off transparent_hugepage=never default_hugepagesz=1G hugepagesz=2M hugepagesz=1G       重新開機後就會變這樣       cat /proc/meminfo |grep -i huge       AnonHugePages:         0 kB       HugePages_Total:       0       HugePages_Free:        0       HugePages_Rsvd:        0       HugePages_Surp:        0       Hugepagesize:    1048576 kB

申請132GB大頁

sysctl -w vm.nr_hugepages=132       vm.nr_hugepages = 132

重新開機後可以使用

grep Huge /proc/meminfo

檢視配置情況。看到下面的資料表示已經生效

HugePages_Total:    1536       HugePages_Free:     1499       HugePages_Rsvd:     1024       HugePages_Surp:        0       Hugepagesize:       2048 kB

資料庫配置（如果你想好了非大頁不可，就設定huge_pages為on，否則設定為try。on的情況下如果HUGE PAGE不夠，則啟動會報錯。TRY的話，大頁不夠就嘗試申請普通頁的方式啟動。）

postgresql.conf       huge_pages = on       shared_buffers = 2GB  # 使用2G記憶體，這個值需要小于總共大頁面記憶體量

注意

如果postgresql.conf配置huge_pages=on時，且shared_buffers值等于huge_page總記憶體量（

hugepagesz*hugepages

）時，資料庫無法啟動，報如下錯誤：

This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space.

解決辦法shared_buffers值要小于huge_page總記憶體量

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#libhugetlbfs libhugetlbfs

安裝libhugetlbfs可以觀察大頁的統計資訊，配置設定大頁檔案系統，例如你想把資料放到記憶體中持久化測試。

https://lwn.net/Articles/376606/

yum install -y libhugetlbfs*

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E5%B0%8F%E7%BB%93 小結

1、檢視、修改Linux目前支援的大頁大小。

https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize

2、如果連接配接數較少時，使用HUGE PAGE性能不如不使用（猜測可能是huge page使用了類似兩級轉換，導緻了一定的損耗。）。是以我們可以盡量使用連接配接池，減少連接配接數，提升性能。

3、能使用連接配接池的話，盡量使用連接配接池，減少連接配接到資料庫的連接配接數。因為PG與Oracle一樣是程序模型，連接配接越多則程序越多，大記憶體需要注意一些問題：

3.1 上下文切換，MEM COPY的開銷。

3.2 PAGE TABLE增大，記憶體使用增加。

PageTables: Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.

3.3 每個會話要緩存syscache, relcache等資訊，如果通路的對象很多，會導緻記憶體使用爆增。(這個屬于邏輯層面記憶體使用放大, 如果通路對象不多或者通路過好多對象的長連接配接不多的話，問題不明顯)

《PostgreSQL relcache在長連接配接應用中的記憶體霸占"坑"》

這個很容易模拟，使用本例的壓測CASE，增加表的數目，增加表的字段數，每個連接配接的relcache就會增加。

4、如果不能使用連接配接池，連接配接數非常多，并且都是長連接配接(通路較多的對象、shared buffer中的資料時)。那麼當shared buffer非常大時，需要考慮使用huge page。這樣page tables會比較小。如果無法使用HugePage，那麼建議設定較小的shared_buffer。

5、程序自己的記憶體使用,PAGE TABLE不會有放大效果，因為隻是自己使用。是以work_mem, maintenance_work_mem的使用，不大會引起PAGE TABLE過大的問題。

通過觀察/proc/meminfo來檢視PageTable的占用，判斷是否需要啟用大頁或降低shared_buffer或者連接配接數。

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E5%8F%82%E8%80%83 參考

《PostgreSQL on Linux 最佳部署手冊》《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(适合新使用者)》 https://blog.dbi-services.com/configuring-huge-pages-for-your-postgresql-instance-redhatcentos-version/ https://momjian.us/main/writings/pgsql/hw_performance/ https://www.kernel.org/doc/gorman/html/understand/understand006.html https://wiki.osdev.org/Page_Tables https://github.com/digoal/blog/blob/master/201803/20180325_02_pdf_001.pdf https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Memory-Configuring-huge-pages

PostgreSQL Huge Page 使用建議 - 大記憶體主機、執行個體注意

标簽

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#1-%E8%84%8F%E9%A1%B5%E8%B0%83%E5%BA%A6%E4%BC%98%E5%8C%96 1 髒頁排程優化

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#2-%E5%86%85%E5%AD%98%E8%A1%A8%E6%98%A0%E5%B0%84%E4%BC%98%E5%8C%96 2 記憶體表映射優化

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#postgresql-hugepage%E4%BD%BF%E7%94%A8%E5%BB%BA%E8%AE%AE PostgreSQL HugePage使用建議

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#oracle-hugepage%E4%BD%BF%E7%94%A8%E5%BB%BA%E8%AE%AE Oracle HugePage使用建議

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#101-about-hugepages 10.1 About HugePages

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E6%B5%8B%E8%AF%95%E5%AF%B9%E6%AF%94%E6%98%AF%E5%90%A6%E4%BD%BF%E7%94%A8hugepage 測試對比是否使用HugePage

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E8%AE%BE%E8%AE%A1test-case 設計test case

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#1-%E4%BD%BF%E7%94%A8hugepage 1 使用HugePage

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#2-%E6%9C%AA%E4%BD%BF%E7%94%A8hugepage 2 未使用HugePage

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#centos-7u-%E9%85%8D%E7%BD%AE%E5%A4%A7%E9%A1%B5%E4%BE%8B%E5%AD%90 CentOS 7u 配置大頁例子

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#libhugetlbfs libhugetlbfs

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E5%B0%8F%E7%BB%93 小結

https://github.com/digoal/blog/blob/master/201803/20180325_02.md#%E5%8F%82%E8%80%83 參考

繼續閱讀

禁止ubuntu系統彈出報錯界面

MySQL的4種隔離級别？出現問題

Ubuntu Linux下Apache的配置檔案

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

samba伺服器的功能

登入plsql 報錯 the account is locked --使用者被鎖

【Linux】UDP廣播封包接收速率問題

SequoiaDB巨杉資料庫C++驅動概述

Linux裝置模型（中）之上層容器

Oracle 批量查詢傳入List 傳回List

PowerPC平台 Linux移植三