PostgreSQL技術大講堂 - 第44講：pg流復制部署

PostgreSQL從小白到專傢，是從入門逐漸能力提升的一個系列教程，內容包括對PG基礎的認知、包括安裝使用、包括角色權限、包括維護管理、、等內容，希望對熱愛PG、學習PG的同學們有幫助，歡迎持續關註CUUG PG技術大講堂。

第44講：流復制部署

PostgreSQL第44講：2月3日(周六)19:30，釘釘群（35822460）& 視頻號（數據庫老陳）直播“老陳和德哥pg流復制部署現場等你來...”

內容1：流復制部署

流復制部署手冊

一主備機器規劃主機名

角色主機名 ip

Maswer： Pg1 192.168.18.211

Slave： Pg2 192.168.18.212

PG版本：

用源碼編譯安裝的12.2版本

二創建流復制

2.1 設置host

master,slave兩節點都要操作。

# vim /etc/hosts

#編輯內容如下：

192.168.18.211 pg1

192.168.18.212 pg2

2.2 在主庫設置

2.2.1先初始化新數據庫

$ initdb -D /usr/local/pg12.2/data -U postgres --data-checksums

2.2.2啟動數據庫並建立同步用戶

$pg_ctl -D /usr/local/ pg12.2/data start

創建用戶：create role 同步用的用戶名 login replication encrypted password '密碼';

postgres=# create role repl login replication encrypted password 'repl';

CREATE ROLE

postgres=#\q --退出psql

2.2.3配置$PGDATA/data/pg_hba.conf，添加下面內容：

格式：host replication 同步用的用戶名備庫IP地址或域名/24 trust

2.2.4配置主備庫的postgres.con文件

因為以後要進行角色切換，所以現在都添加一樣的參數。

host replication repl pg2 trust

host replication repl 192.168.18.0/24 trust

host all all 192.168.18.0/24 trust

主庫配置~/data/postgres.conf 查找並修改成以下內容

listen_addresses = '*'

wal_level = replica --10以後的版本為replica 物理復制，有一定局限性，可以選擇邏輯復制logical

max_wal_senders=10 --流復制允許連接進程，主備庫這個參數值必須一樣

wal_keep_segments =64

archive_mode = on -- 設置歸檔模式

archive_command = 'cp %p /home/postgres/arch/%f' --設置歸檔cp命令

listen_addresses = '*'

port = 1922

wal_level = logical

max_wal_senders=20

wal_keep_segments =64

archive_mode = on

archive_command = 'cp %p /home/postgres/arch/%f'

restore_command = 'cp /home/postgres/arch/%f %p'

recovery_target_timeline = 'latest'

log_destination = 'csvlog'

logging_collector = on

log_directory = 'pg_log' ##$PGDATA/pg_log

log_filename = 'postgresql-%Y-%m-%d'

log_truncate_on_rotation = off

log_rotation_age = 1d

log_rotation_size = 0

log_error_verbosity = verbose

log_statement = all

2.2.5重啟主庫服務，以更新配置

$pg_ctl -D ~/data/ -l ~/log/pglog.log restart

2.3 在備庫設置

2.3.1不需要初始化，直接從主庫備份就行，如有DATA直接刪掉或改名掉:

$ pg_basebackup -h pg1 -p 1922 -U repl -R -F p -P -D $PGDATA

備註：

-h，主庫主機，-p，主庫服務端口;

-U，復制用戶;

-F，p是默認輸出格式，輸出數據目錄和表空間相同的佈局，t表示tar格式輸出;

-P，同--progress，顯示進度;

-D，輸出到指定目錄;

-R 創建一個recovery.conf文件，10版本後就沒有該文件，改為standby.signal文件，需要自己創建，所以該參數可以省略

如果報錯：

pg_basebackup: error: could not connect to server: could not connect to server: No route to host Is the server running on host "pg1" (192.168.18.211) and accepting TCP/IP connections on port 1922?

解決方法：發現是系統防火墻的問題：

# 查看防火墻狀態

service iptables status

systemctl stop firewalld --centos 7

systemctl disable firewalld --centos 7

# 停止防火墻

service iptables stop

# 永久關閉防火墻

chkconfig iptables off

2.3.2備庫修改配置文件(由於是從主庫備份過來的，配置文件也修改了)

postgres@NanoPI-006:~$vi ~/data/postgresql.conf

添加以下內容

listen_addresses = '*'

wal_level = replica

max_wal_senders=20

wal_keep_segments =64

archive_mode = on

archive_command = 'cp %p /home/postgres/arch/%f'

restore_command = 'cp /home/postgres/arch/%f %p'

recovery_target_timeline = 'latest'

full_page_writes = on

wal_log_hints = on

hot_standby = on #在同步的同時允許查詢，默認值

max_standby_streaming_delay = 30s #可選，流復制最大延遲

wal_receiver_status_interval = 10s #可選，從向主報告狀態的最大間隔時間

hot_standby_feedback = on #可選，查詢沖突時向主反饋

2.3.3配置~/data/pg_hba.conf

添加下面內容

host replication repl 192.168.18.0/24 trust

host all all 192.168.18.0/24 trust

#在備庫中維護的主庫IP地址是為了以後切換使用

2.3.4、創建備庫文件standby.signal

primary_conninfo = 'host=pg1 port=1922 user=repl password=repl options=''-c wal_sender_timeout=5000'''

restore_command = 'cp /home/postgres/arch/%f %p'

archive_cleanup_command = 'pg_archivecleanup /home/postgres/arch %r'

standby_mode = on

第一行參數：#連接到主庫信息

第二行參數：將來變成主庫時需要用到的參數。

第三行參數：變成主庫後需要清空的歸檔日志。

第四行參數：把備庫變成read-only transaction模式，不允許進行寫操作。允許查詢。這一點非常好。

2.3.5啟動備庫數據服務

$pg_ctl -D $PGDATA -l ~/log/pglog.log start

增加日志輸出設置，從而獲得更多的信息。

三、驗證

3.1、觀察主從兩庫的歸檔日志的位置，或者主庫兩邊的pg_wal目錄下的內容，發現主庫日志切換後，備庫pg_wal目錄下就會產生新的日志文件，但是在備庫的歸檔目錄下沒有內容，應該是主庫的歸檔日志傳遞到備庫的pg_wal目錄下了。

3.2、主庫修改後，日志沒有歸檔，但是備庫已經同步了，類似於oracle同步時用lgwr方式進行寫standby_logfile進行同步。

3.3、查看當前備庫狀態：

testdb=# select pg_is_in_recovery();

pg_is_in_recovery

-------------------

t ：true，意味著處於recovery狀態

f ：false，意味著處於正常服務狀態

主庫查詢：

testdb=# \x

testdb=# select * from pg_stat_replication;

-[ RECORD 1 ]---- ------------------------------

pid | 3711

usesysid | 16384

usename | repl

application_name | walreceiver

client_addr | 192.168.18.212

client_hostname | pg2

client_port | 49206

backend_start | 2020-03-03 22:08:47.924435-05

backend_xmin |

state | streaming

sent_lsn | 0/210000D8

write_lsn | 0/210000D8

flush_lsn | 0/210000D8

replay_lsn | 0/210000D8

write_lag |

flush_lag |

replay_lag |

sync_priority | 0

sync_state | async

reply_time | 2020-03-03 22:13:02.990258-05

#application_name 很重要，以後同步復制需要用到。

3.4、備庫數據庫日志內容：

cp: cannot stat `/home/postgres/arch/00000002.history': No such file or directory

cp: cannot stat `/home/postgres/arch/000000010000000000000009': No such file or directory

2020-02-29 04:48:45.734 EST [4938] FATAL: could not connect to the primary server: could not connect to server: Connection refused

Is the server running on host "pg1" (192.168.18.211) and accepting

TCP/IP connections on port 1922?

cp: cannot stat `/home/postgres/arch/00000002.history': No such file or directory

cp: cannot stat `/home/postgres/arch/000000010000000000000009': No such file or directory

2020-02-29 04:48:50.747 EST [4941] LOG: started streaming WAL from primary at 0/9000000 on timeline 1

如果主庫關閉，備庫數據庫日志內容：

cp: cannot stat `/home/postgres/arch/00000002.history': No such file or directory

cp: cannot stat `/home/postgres/arch/00000001000000000000000C': No such file or directory

2020-02-29 05:22:55.757 EST [5048] FATAL: could not connect to the primary server: could not connect to server: Connection refused

Is the server running on host "pg1" (192.168.18.211) and accepting

TCP/IP connections on port 1922?

cp: cannot stat `/home/postgres/arch/00000002.history': No such file or directory

3.5、主庫後臺進程：

ps -ef|grep "wal"

postgres 3753 3749 0 21:21 ? 00:00:00 postgres: walwriter

postgres 3844 3749 0 21:49 ? 00:00:00 postgres: walsender repl 192.168.18.212(33595) streaming 0/8000148

3.6、備庫後臺進程，一個進程負責接收，一個負責recovery：

ps -ef|grep postgres

postgres 3472 3471 0 21:49 ? 00:00:00 postgres: startup recovering 000000010000000000000008

postgres 3475 3471 0 21:49 ? 00:00:00 postgres: checkpointer

postgres 3476 3471 0 21:49 ? 00:00:00 postgres: background writer

postgres 3478 3471 0 21:49 ? 00:00:00 postgres: stats collector

postgres 3479 3471 0 21:49 ? 00:00:00 postgres: walreceiver streaming 0/8000148

四、主從切換

4.1、停掉主庫

pg_ctl stop -m fast

4.2、執行以下命令進行主從切換，把備庫改成主庫，執行之後發現standby.signal被刪除了：

pg_ctl promote

查看最新狀態：

pg_controldata | grep cluster

Database cluster state: in production

4.3、這一步非常關鍵，註意原來的備庫的postgresql.auto.conf文件中會自動添加一行primary_conninfo的信息，要把這一行給註釋掉，否則雖然現在是主庫了，但是配置還是當作備庫，自相矛盾，且在跟蹤日志中會報“background worker "logical replication launcher" (PID 6304) exited with exit code 1”錯誤。這可能是PG12.2的bug。

postgresql.auto.conf文件內容如下，註意下面內容隻是一行數據，/home/postgres/.pgpass其實沒有沒有這個文件，不需要創建：

primary_conninfo = 'user=repl passfile=''/home/postgres/.pgpass'' host=pg1 port=1922 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any'

重啟數據庫，查看後臺進程，實驗發現walsender進程要等備庫正常啟動後才會啟動，備庫關閉時該進程也自動中斷：

ps -ef|grep postgres |grep -v sshd |grep -v bash

postgres 3215 3164 0 Feb29 pts/3 00:00:00 tail -f pg_log

postgres 6329 1 0 07:08 ? 00:00:00 /usr/local/pg12.2/bin/postgres

postgres 6331 6329 0 07:08 ? 00:00:00 postgres: checkpointer

postgres 6332 6329 0 07:08 ? 00:00:00 postgres: background writer

postgres 6333 6329 0 07:08 ? 00:00:00 postgres: walwriter

postgres 6334 6329 0 07:08 ? 00:00:00 postgres: autovacuum launcher

postgres 6335 6329 0 07:08 ? 00:00:00 postgres: archiver

postgres 6336 6329 0 07:08 ? 00:00:00 postgres: stats collector

postgres 6337 6329 0 07:08 ? 00:00:00 postgres: logical replication launcher

postgres 6353 6329 0 07:12 ? 00:00:00 postgres: walsender repl 192.168.18.212(33609) streaming 0/1A01C7F8

4.4、在新備庫上創建一個standby.signal文件，添加如下內容：

primary_conninfo = 'host=pg2 port=1922 user=repl password=repl options=''-c wal_sender_timeout=5000'''

restore_command = 'cp /home/postgres/arch/%f %p'

archive_cleanup_command = 'pg_archivecleanup /home/postgres/arch %r'

standby_mode = on

4.5、在新備庫的postgresql.auto.conf文件中添加如下內容，這一步非常關鍵，第一次搭建備庫的時候會自動添加，但是切換後卻不能：

primary_conninfo = 'user=repl passfile=''/home/postgres/.pgpass'' host=pg2 port=1922 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any'

註意/home/postgres/.pgpass其實沒有沒有這個文件，不需要創建。

4.6、啟動新備庫：

pg_ctl start -l pg_log

4.7、查看後臺進程：

ps -ef|grep postgre |grep -v ssh |grep -v bash

postgres 3274 3237 0 Feb29 pts/3 00:00:00 tail -f pg_log

postgres 6441 1 0 07:12 ? 00:00:00 /usr/local/pg12.2/bin/postgres

postgres 6442 6441 0 07:12 ? 00:00:00 postgres: startup recovering 00000003000000000000001A

postgres 6447 6441 0 07:12 ? 00:00:00 postgres: checkpointer

postgres 6448 6441 0 07:12 ? 00:00:00 postgres: background writer

postgres 6450 6441 0 07:12 ? 00:00:00 postgres: stats collector

postgres 6451 6441 0 07:12 ? 00:00:05 postgres: walreceiver streaming 0/1A01C7F8

4.8、驗證主備庫是否能夠同步

在主庫進行dml操作，發現備庫能夠正常同步，切換成功。

4.9、主庫變成備庫時，有時候啟動會失敗，日志顯示找不到xxx.history日志文件，解決辦法，在postgresql.conf中指定明確的timeline，把原來的latest替換掉：

recovery_target_timeline = '3'

有時需要把缺少的文件復制到備庫的歸檔目錄下，比如xxxx.history文件。

總結：

經過實驗，發現主備切換不太靈活和智能，需要後續進行手動修改，特別是postgresql.auto.conf文件中自動添加的一行，在主備切換的時候不會自動刪除，沒有相關文檔，造成了隱性的問題，給DBA造成了很大的麻煩，不容易故障排除。

主庫在正常運行中，備庫可以隨意切換為主庫，沒有一個制約機制，感覺不嚴謹，此時變成兩個主庫，數據無法同步。如果此時兩邊的數據庫都各自發生變化，將來想把一臺主庫當作備庫，則需要在備庫上對當前的數據進行同步，然後就可以變成備庫，用以下的命令進行同步：

pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.18.211 port=1922 user=postgres dbname=testdb'

pg_rewind: servers diverged at WAL location 0/1C01F280 on timeline 5

pg_rewind: rewinding from last common checkpoint at 0/1C01F1D0 on timeline 5

pg_rewind: Done!

五、pg_rewind工具

如果備庫是意外崩潰，如果新的主庫修改了數據，經過的時間很長，歸檔日志又刪除了，無法同步，原來的數據庫如果想變成備庫，需要對數據庫做一次同步，那麼就可以用到pg_rewind工具進行同步。

pg_rewind—使一個PostgreSQL數據目錄與另一個數據目錄(該目錄從第一個PostgreSQL數據目錄創建而來)一致。

描述

pg_rewind是一個在集群的時間線參數偏離之後，用於使一個PostgreSQL集群與另一個相同集群的拷貝同步的工具。一個典型的場景是在故障轉移之後，讓一個老的主服務器重新在線作為一個standby跟隨新主服務器。

其結果相當於使用源數據目錄替換目標數據目錄。所有的文件都被拷貝，包括配置文件。與做一個基礎備份或者像rsync這樣的工具相比，pg_rewind的優勢是pg_rewind不需要讀取所有集群中沒有更改的文件。當數據庫很大，並且隻有一小部分不同的集群之間，使它的速度快得多。

pg_rewind檢查源集群與目標集群的時間線歷史來檢測它們產生分歧的點，並希望在目標集群的pg_xlog目錄找到WAL回到分歧點的所有方式。在典型的故障轉移場景：目標集群在分歧之後立即被關閉，那是沒有問題的，但是，如果目標集群在分歧之後運行了很長一段時間，老的WAL文件可能不存在了。在這種情況下，它們可以手動從WAL歸檔復制到pg_xlog目錄。目前不支持從一個WAL歸檔中自動獲取丟失的文件。

在運行pg_rewind之後，當目標服務器第一次被啟動，它將進入恢復模式並重放從分歧點之後源服務器產生的所有WAL。當pg_rewind被運行時，如果一些 WAL在源服務器上不再可用，因此不能用pg_rewind回話復制，當目標服務器被啟動時時可以的。這可以通過在目標數據目錄創建一個帶有合適的restore_command命令的recovery.conf文件來實現。

選項

pg_rewind 接受下列命令行參數：

-D 目錄

--target-pgdata=目錄

該選項指定與源同步的目標數據目錄。

--source-pgdata=目錄

指定源服務器的數據目錄的路徑，以使目標數據目錄與之同步。當—source-pgdata被使用時，源服務器必須被關閉。

--source-server=連接字符串

指定一個libpq連接字符串以連接到源PostgreSQL服務器來使目標同步。服務器必須開啟並允許，並且不能處於恢復模式。

-n

--dry-run

做除了修改目標目錄的所有事情。

-P

--progress

開啟進程報告。在從源集群復制數據時，打開這個功能將提供一個近似的進度報告。

--debug

打印詳細的調試輸出對開發者調試pg_rewind來說是非常有用的。

-V

--version

顯示版本信息並退出。

--help

顯示幫助，然後退出

環境

當—source-server選項被使用時，pg_rewind也使用libpq支持的環境變量 (見31.14節)。

註意

pg_rewind需要啟用postgresql.conf中的wal_log_hints 選項，或者當集群被使用initdb初始化時啟用數據校驗。full_page_writes也必須啟用。

pg_rewind是如何工作的

基本的思想是從新的集群拷貝所有的東西到老的集群，除了我們知道的相同的(數據)塊。

1.從最後一個檢查點開始掃描老集群的WAL日志，在該檢查點之前，新集群的時間線歷史從老集群被創建出來。對於每一個WAL記錄，做一個數據塊被觸及的記錄。在新的集群被創建出來以後，這產生所有在老集群中被更改的數據塊的列表。

2.從新集群復制所有這些被更改的數據塊到老集群。

3.從新集群復制所有其它像clog，conf這樣的文件等等到老集群。每個文件，除了表文件。

4.從新集群應用WAL，從故障轉移創建的檢查點開始。(嚴格的說，pg_rewind不應用WAL，它隻是創建一個備份標簽文件以表明PostgreSQL被啟動了，它會從檢查點重放並應用所有需要的WAL)

2020-02-28 01:58:35.974 EST [16990] LOG: received promote request

2020-02-28 01:58:35.974 EST [16990] LOG: redo done at 0/50000028

2020-02-28 01:58:35.977 EST [16990] LOG: last completed transaction was at log time 2020-02-27 21:40:31.673922-05

cp: cannot stat `/home/postgres/arch/000000090000000000000050': No such file or directory

cp: cannot stat `/home/postgres/arch/0000000A.history': No such file or directory

2020-02-28 01:58:35.987 EST [16990] LOG: selected new timeline ID: 10

2020-02-28 01:58:36.090 EST [16990] LOG: archive recovery complete

cp: cannot stat `/home/postgres/arch/00000009.history': No such file or directory

2020-02-28 01:58:36.112 EST [16989] LOG: database system is ready to accept connections

五、實時同步

上面的配置是異步同步，對於主庫的性能影響是最小的，但是會丟數據，我們可以把復制配置成實時同步。

當設置同步復制時，盡量記住以下幾點：

最小化延遲

確保您有冗餘延遲

同步復制比異步復制代價更高

同步時是通過一個關鍵的參數application_name來實現的。

5.1、配置主庫postgres.conf，添加如下內容：

synchronous_standby_names = 'standby_pg2'

synchronous_commit = on --默認值，可以設置為remote_write，對主庫性能有利

5.2、重啟主庫

5.3、修改備庫standby.signal配置文件，在原來的內容中添加application_name內容：

primary_conninfo = 'host=pg1 application_name=standby_pg2 port=1922 user=repl password=oracle options=''-c wal_sender_timeout=5000'''

restore_command = 'cp /home/postgres/arch/%f %p'

archive_cleanup_command = 'pg_archivecleanup /home/postgres/arch %r'

standby_mode = on

5.4、修改備庫postgresql.auto.conf，添加application_name內容，實際上備庫是以這個文件為主，上面修改的standby.signal並不生效：

primary_conninfo = 'user=repl passfile=''/home/postgres/.pgpass'' host=pg1 application_name=standby_pg2 port=1922 sslmode=disable sslcompression

=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any'

5.5、重啟備庫，查看後臺日志信息：

consistent recovery state reached at 0/21000188

invalid record length at 0/210001C0: wanted 24, got 0

database system is ready to accept read only connections

started streaming WAL from primary at 0/21000000 on timeline 6

5.6、在主庫查看同步狀態：

postgres=# \x --以行形式顯示，類似於mysql的\G

postgres=# SELECT * FROM pg_stat_replication;

-[ RECORD 1 ]---- ------------------------------

pid | 3732

usesysid | 16384

usename | repl

application_name | standby_pg2

client_addr | 192.168.18.212

client_hostname | pg2

client_port | 49207

backend_start | 2020-03-03 22:14:24.010759-05

backend_xmin |

state | streaming

sent_lsn | 0/210001C0

write_lsn | 0/210001C0

flush_lsn | 0/210001C0

replay_lsn | 0/210001C0

write_lag |

flush_lag |

replay_lag |

sync_priority | 1

sync_state | sync

reply_time | 2020-03-03 22:14:44.126791-05

狀態顯示為實時同步。

5.7、驗證：

在同步過程中，如果把備庫給關閉，然後在主庫進行數據操作，會發現無法操作，該事務會掛起，處於等待狀態。此時對主庫會造成很大的影響，跟oracle的最大保護模式一樣。

5.8、如果我們配置了多個備庫，而且進行實時同步，假如隻要保證前面的備庫能夠實時就可以，那麼可以進行如下設置：

synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'

5.9、如果隻要保證其中任何的備庫同步成功，可以進行如下設置：

synchronous_standby_names = 'ANY 2 (s1, s2, s3)'

六、添加節點

6.1、添加新的節點跟第二個節點添加方式一樣，修改standby.signal和postgres.auto.conf文件，然後啟動節點三。

6.2、修改主庫的postgres.conf，添加如下一行：

synchronous_standby_names = 'FIRST 2 (standby_pg2,standby_pg3)'

6.3、重啟主庫，查看復制狀態：

testdb=# \x

Expanded display is on.

testdb=# select * from pg_stat_replication;

-[ RECORD 1 ]---- ------------------------------

pid | 8604

usesysid | 16384

usename | repl

application_name | standby_pg3

client_addr | 192.168.18.213

client_hostname | pg3

client_port | 34436

backend_start | 2020-03-06 05:46:23.01908-05

backend_xmin |

state | streaming

sent_lsn | 0/2A00FA38

write_lsn | 0/2A00FA38

flush_lsn | 0/2A00FA38

replay_lsn | 0/2A00FA38

write_lag |

flush_lag |

replay_lag |

sync_priority | 2

sync_state | sync

reply_time | 2020-03-06 05:46:33.088474-05

-[ RECORD 2 ]---- ------------------------------

pid | 4716

usesysid | 16384

usename | repl

application_name | standby_pg2

client_addr | 192.168.18.212

client_hostname | pg2

client_port | 50026

backend_start | 2020-03-06 03:13:46.966522-05

backend_xmin |

state | streaming

sent_lsn | 0/2A00FA38

write_lsn | 0/2A00FA38

flush_lsn | 0/2A00FA38

replay_lsn | 0/2A00FA38

write_lag |

flush_lag |

replay_lag |

sync_priority | 1

sync_state | sync

reply_time | 2020-03-06 05:46:27.970934-05

6.4、驗證同步

主要備庫的任何一個節點無法同步，都會影響主庫的事務操作。但是發現正常的一個備庫節點能夠同步，即使主庫處於停留狀態，由此證明主庫已經把事務傳遞到備庫了，隻是有備庫沒有同步，所以處於等待狀態。

6.5、如果把主庫的參數修改如下：

synchronous_standby_names = 'FIRST 1 (standby_pg2,standby_pg3)'

6.6、實驗證明，如果第二個備庫節點發生故障無法同步，不會影響主庫事務操作。

七、其它配置

7.1、正常情況下備庫會盡快恢復來自於主服務器的 WAL 記錄。但是有時候備庫的復制延遲一段時間，它能提供機會糾正數據丟失錯誤。雖然這種需求比較少見，但是也有個別的需求，recovery_min_apply_delay參數允許你將復制延遲一段時間，默認時間單位則為毫秒。例如，如果你設置這個參數為10min，對於一個事務提交，隻有備庫的系統時間超過主庫的提交時間至少 5分鐘時，備庫才會應用該事務。

在備庫的postgresql.auto.conf添加如下參數，備庫延遲recovery：

recovery_min_apply_delay = 1min

重啟數據庫生效，就會發現備庫延遲一分鐘recovery，註意這個參數隻是控制備庫延遲應用日志，不影響主庫傳輸日志到備庫，即使主備庫配置成實時同步，不會影響主庫事務操作。

7.2、如果設置了synchronous_commit=remote_apply，然後再設置recovery_min_apply_delay = 1min，會發現生產庫的事務會發生等待，直到備庫過一分鐘recovery結束後才完成，所以要避免這種情況發生。

7.3、如果把如果pg數據庫的歸檔日志都存放在一個目錄下，那麼將來主從切換的時候會造成錯誤，導致啟動失敗。

八、提高主庫的可用性和故障處理

處於同步復制的備用服務器發生故障並且不再能夠返回ACK響應，主服務器仍將繼續永遠等待響應。因此，無法提交正在運行的事務，也無法啟動後續查詢處理。流式復制不支持通過超時自動還原到異步模式的功能。

兩種解決辦法：

使用多個備用服務器來提高系統可用性

通過手動執行從同步模式切換到異步模式

(1) 將參數synchronous_standby_names設置為空字符串。

(2) 使用reload選項執行pg_ctl命令。

postgres> pg_ctl -D $PGDATA reload

我們討論第一種解決辦法：使用多個備用服務器來提高系統可用性。

1、配置主庫postgres.conf文件：

synchronous_standby_names = 'standby_pg2,standby_pg3'

--此時pg2的優先級比pg3的要高

2、查看流復制狀態：

testdb=# SELECT application_name AS host, sync_priority, sync_state FROM pg_stat_replication;

host | sync_priority | sync_state

------------- --------------- ------------

standby_pg2 | 1 | sync

standby_pg3 | 2 | potential

2、把standby_pg2數據庫關閉，則standby_pg3就會變成sync，而生產庫進行dml操作不受到影響，因為此時standby_pg3替代了standby_pg2，成為第一備庫。

testdb=# SELECT application_name AS host, sync_priority, sync_state FROM pg_stat_replication;

host | sync_priority | sync_state

------------- --------------- ------------

standby_pg3 | 2 | sync

3、如果此時把standby_pg3也關閉，則主庫的ddl和dml操作就會處於等待狀態，因為當前沒有可用的備庫來進行實時同步。

4、接下來隻要啟動任一的備庫，就會立刻成為第一備庫，則生產庫就能夠繼續進行數據操作。

註意：

根據故障類型的不同，通常可以在故障發生後立即檢測到故障，而有時在故障發生和檢測到故障之間可能有一個時間間隔。特別是，如果同步備用服務器中發生這一種類型的故障(硬件和網絡的故障檢測)，則主服務器上的所有事務處理都將停止，直到檢測到備用服務器的故障為止，即使多個潛在的備用服務器可能已在工作。

把pg數據庫的日志功能打開，可以查看更多的信息：

postgres.conf添加參數如下：

log_destination = 'csvlog'

logging_collector = on

log_directory = 'pg_log'

log_filename = 'postgresql-%Y-%m-%d'

log_truncate_on_rotation = off

log_rotation_age = 1d

log_rotation_size = 0

log_error_verbosity = verbose

log_statement = all

經過測試，發現把日志目錄存放在$PGDATA/pg_log下，能夠記錄的內容很多，經過觀察發現pg的很多自身的命令其實在數據庫裡面都轉換成sql語句。

比如：

日志信息如下：

2020-04-29 04:53:23.367 EDT,"postgres","testdb",4655,"[local]",5ea93fed.122f,2,"idle",2020-04-29 04:50:53 EDT,4/9,0,LOG,00000,"statement: SELECT d.datname as ""Name"",

pg_catalog.pg_get_userbyid(d.datdba) as ""Owner"",

pg_catalog.pg_encoding_to_char(d.encoding) as ""Encoding"",

d.datcollate as ""Collate"",

d.datctype as ""Ctype"",

pg_catalog.array_to_string(d.datacl, E'\n') AS ""Access privileges""

FROM pg_catalog.pg_database d

ORDER BY 1;",,,,,,,,"exec_simple_query, postgres.c:1045","psql"

查看主備庫同步的差異情況：

select client_addr,application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) as diff,

sync_state from pg_stat_replication;

client_addr | application_name | diff | sync_state

------------- ------------------ ------ ------------

172.21.0.6 | 172.21.0.6:11000 | 0 | async

查看主備同步狀態：

select

postgres-# current_setting('synchronous_commit') as synchronous_commit,

current_setting('synchronous_standby_names') as synchronous_standby_names,

array((select client_addr||'-'||application_name||'-'||sync_state from pg_stat_replication )) as sync_state;

synchronous_commit | synchronous_standby_names | sync_state

-------------------- --------------------------- ----------------------------------------

on | | {172.21.0.6/32-172.21.0.6:11000-async}

新建一個表空間，然後把create的權限授權給某個用戶，停止不動，原因是其它備庫上沒有創建新表空間的所在的目錄/home/postgres/newtbl。

CUUG PostgreSQL技術大講堂系列公開課第44講-流復制部署，往期視頻及文檔，請聯系CUUG。