Hooray! I finished reading “Learn PostgreSQL” book – a 700 page brick 😉 It took over a month of waking up at 6am reading, making notes in my notebooks and summaries on this blog.
I was exposed to many new concepts and feature of Postgres that I was not familiar with before. Now I am looking forward to leverage full power of Postgres in my GADM project and at my workplace. Especially I want to experiment with partitioning, query tuning, backups and replication.
I would give this book 6.5 out of 10 points. I like the selection of topics and the fact that this book is both practical and theoretical. Nevertheless, I would prefer if it had more deep dives and fewer step-by-step instructions. The book is feels verbose in some places. I feel like it could have been shrunk to 500 pages and still cover all the topics.
Last chapter of the book mentions few useful Postgres extensions:
– pg_trgm – a tool for indexing text fields. Native Postgres indexes can support only LIKE something%
queries where search phrase starts with exact value. The pg_trgm can index LIKE %something%
queries by building trigrams and building GIST or GIN indexes.
– pgbackrest – a backup automation tool that can store the backups in the cloud buckets. It supports PITR, backup encryption and much more.
– postgres_fdw – extension for handling foreign data wrappers. It enables connection to external data sources like other Postgres instances, MySql or NonSql databases. Once connection is established you can query remote tables as if they are local.
– pgloader – Allows migrating from other databases into Postgres.
I finished the logical replication chapter and started the last chapter of the book – useful extensions.
Postgres provide replication monitoring utilities – it’s the same table that monitors physical replication – pg_stat_replication
table. You can also get more information about publishing and subscribing processes by querying pg_publication
and pg_subscribtion
tables on primary and replica nodes respectively.
Logical replication doesn’t block writes on replica node. However it seems to be bad practice to write to replica since in case of collisions on unique fields the pub/sub system will fail and stop streaming new messages to subscribers until DBA resolves the issue.
Synchronization errors are then fixed by dropping subscription, truncating tables with discrepancies and re-creating the subscription.
Moreover logical replication doesn’t support DDL changes so DBA is responsible for keeping databases schemas in sync. Otherwise pub/sub will again fail and require manual fix.
Postgres logical replication relies on pub/sub mechanism. Primary node decodes WAL segments and extrapolates SQL queries from them, then it publishes the commands while replica node subscribes to it and passes received queries directly to query executor.
Logical replication:
– is a bit slower then physical replications
– allows replication across Postgres versions
– allows selective data replication e.g. only certain databases or tables
– copies only Data Modification Language (DML). So DBA has to ensure that the replica and primary node have synchronized schemas and settings.
– allows small clean copy e.g. for testing purposes
Physical replication:
– fast
– replicates entire cluster on page
level. This means that unlogged tables are not replicated.
– replicates both Data Definition Language – DDL and Data Modification Language – DML
– works only on corresponding major versions of Postgres
– it is a exact copy of entire cluster
The remaining part of the chapter about physical replication consisted mainly of a walk-through the replication setup steps. It is recommended to replicate clusters on machines with the same hardware parameters – this way replicas can reuse configuration from primary node without risk of running out of resources.
Having multiple replicas can put too much strain on the primary node, therefore a cascading system is recommended. It means that only one replica sources from primary node while other replicas source from each other.
In case the primary node goes down, we need to promote one of the replicas. This can not be automated with native Postgres tools – a human intervention is required. Use pg_ctl promote -D /path/to/PostgreSQL/<viersion>/main
command to promote a replica to a primary node. Keep in mind that after the promotion old primary node becomes unrecoverable and you will likely suffer data loss.
Physical replication relies on the concept of streaming WAL segments from primary node to a replica node. The replica is in the state of continuous recovery – it continuously execute WAL segments received from primary node.
In Synchronous replication primary node ensures replica received WAL segments before completion, while in asynchronous mode WAL segments are streamed to the replica without waiting for return code.
Primary node has to store more WAL segments when replication is enabled in case the replica node is temporarily down. When replica is back up, primary node need to send all “missed” WAL segments, otherwise replica won’t be able to recover. Amount of “extra” WAL segments stored by primary node can be defined with wal_keep_segments
setting. Since a single WAL segment has 16MB, setting wal_keep_segments
to 100
will mean that primary server stores 1.6GB of extra WAL segments at any given time. This problem can be mitigated by the feature called slots, which ensure that WAL segments sent to replica are being removed from the primary node . You can create and remove a slot using pg_create_physical_replication_slot('master')
and pg_drop_replication_slot('master')
commands.
I was mainly writing down notes about Postgres configuration. I have started reading about physical replication but I will include the notes about this chapter later.
I completed the configuration & monitoring chapter. Postgres allows configuration modifications while cluster runs by means of ALTER SYSTEM
command. Modifications applied this way will be appended to postgresql.auto.conf
file.
You can use configuration generators like PgConfig
with simple GUI interface.
Information about running queries, sessions and backend processes can be read from pg_stat_activity
catalog. Information about locks lives in pg_locks
catalog, while pg_stat_database
includes data about COMMIT and ROLLBACK transactions. You can check table and index usage in pg_stat_user_tables
and pg_stat_user_indexes
catalogs.
All build-in statistic catalogs show real-time data. For persisting historical statistic you need to install pg_stat_statements
extension.
[Book “Learn PostgreSQL”] Postgres stores configuration across several files – posgresql.conf
, postgressql.auto.conf
and pg_hba.conf
. You can extend settings with additional configuration files, this allows to organize settings by topic – use include_file
, include_dir
or include_if_exists
directives.
You can inspect settings using SHOW
command or by querying pg_setting
catalog.
Each configuration belongs to a context which defines when a change to a parameter is applied – internal
, postmaster
, sighup
, superuser-backend
, backend
, user
and superuser
.
Some interesting parameters that can be set are wal_level
, shared_buffers
, work_mem
and wal_buffers
. You can also set cluster id and name with update_process_title
and cluster_name
. Network setting can specify addresses, ports and more – listen_adresses
, port
, max_connections
.
[Book “Learn PostgreSQL”] I completed the chapter about physical backups. They are performed by coping PGDATA, tablespaces and WALs to another machine or different directory. You can leverage a buid-in postgres command pg_basebackup
or 3rd party tool pgBackRest
. Physical backups require careful approach to configuration e.g. connections values and hba rules.
Point in time recovery is achieved by streaming WAL segments to another ‘archive’ location e.g. another computer or bucket. PITR allows cluster to be ‘replayed’ up until selected timestamp.
Postgres is equipped in three tools designated to handle logical backups: pg_dump
, pg_dumpall
and pg_restore
. They offer rich customization for text and binary backups that can copy selected database objects. Dumping and restoring database can be performed by parallel workers to decrease operation time. The COPY
and the native Postgres \copy
commands allow integrating backup dump and restore process with system’s command-line tools.
Backup & Restoration – Postgres supports two types of backups – logical and physical.
Logical backup copies table by table as if any client performing series of SELECT
statements. It is a reliable and simple method that works across major versions. However it takes more time and puts additional cpu and memory strain on the cluster (it requires a transaction to persist over duration of the backup). It only copies the data that exists at the time when backup starts – thus the backed up data can be already outdated one it completes.
Physical backup is less invasive – it copies data from files (PGDATA/base and WALs) and allows Point in time recovery. However it is more complex to perform and recover. It works only within major versions. It can cause a spike I/O bandwidth.
I completed the chapter about logging and auditing. Main takeaways are:
– Postgres logs information about it’s actions. Quantity, frequency, destinaiton and content of these logs is highly configurable.
– There are two types of logging:
syslog
for logging to third party systems. This method can drop logs under heavy traffic.
logging collector
it’s the local logging mechanism that guarantees all logs being saved
– You can use PgBadger
for automated scheduled report creation
– PgAudit
is a tool for more refined logging that is session or user aware
I finished the query tuning chapter. Toady I read about query optimization and auto explain
. Main takeaways are:
– Usually query optimization boils down to leveraging an appropriate index.
– Avoid subqueries since they will likely cause redundant looping.
– Keep an eye on unused indexes (pg_stat_user_indexes
) since they take up space and effort when inserting new rows.
– Use auto-explain
to automate logging of inefficient queries.
– EXPLAIN
use statistics when finding the best query path, EXPLAIN ANALYZE
performs the query and returns execution time. Use transaction and roll back in order to avoid side effects.
I completed the section about indexes and started reading about EXPLAIN and ANALYZE statements. Several recent chapter were theory heavy and required a lot of note taking – this slowed down my progress. I will probably skip notes taking in next chapters – Audit Logging and Configuration … or at least restrain notes to crucial parts.
Toady finished the part about parallel workers. I could not spend the usual amount of time on the book because of the trip to the office. I hope to cover more tomorrow when I will start the section about indexes.