How to Find and Analyze Hot Data in PostgreSQL

Imagine you are managing a multi-terabyte database, but only a fraction of that data is actually powering your application’s real-time performance. Without knowing which tables and indexes are “hot”—meaning frequently accessed and ideally stored in memory—you are essentially flying blind when it comes to provisioning RAM or optimizing queries. You might be suffering from slow response times because critical data is being “washed out” of the shared buffers by large, infrequent sequential scans, yet you lack the visibility to pinpoint the exact cause.

Database administrators often face scenarios where adding more memory doesn’t seem to improve performance, or they need to decide which tables to move to faster NVMe storage. Identifying hot data allows you to focus your tuning efforts on the objects that matter most to your users. By leveraging PostgreSQL’s internal metadata and specific extensions, you can gain a granular view of your database’s “Working Set” and ensure that your most important data remains pinned in memory for lightning-fast access.


Key Takeaways for Identifying Hot Data

  • Shared Buffers Cache → This is the primary memory area where Postgres stores data pages to avoid expensive disk I/O.
  • Usage Count → A metric ranging from 0 to 5 that indicates how frequently a specific data page is accessed; a 5 is considered “very hot”.
  • Cache Hit Ratio → A vital health metric (ideally > 90%) that measures how often the engine finds data in RAM versus reading it from the disk.
  • Index Statistics → System views like pg_stat_user_indexes help identify which indexes are being scanned the most.
  • Buffer Context → Using the pg_buffercache extension allows for a real-time “x-ray” of what is currently occupying your RAM.

Method 1: Using pg_buffercache for Real-Time Heat Analysis

The most direct way to find hot data is through the pg_buffercache extension. This module provides a view that shows exactly which tables and indexes are in the shared buffers and how “hot” they are based on their access frequency.

Command to Install:

CREATE EXTENSION pg_buffercache;

SQL to View Overall Cache Heat: This query groups your memory usage by the usagecount (0-5), showing how much of your cache is truly “hot.”

SELECT usagecount,
       count(*) AS blocks,
       pg_size_pretty(count(*) * 8192) AS size
FROM pg_buffercache
GROUP BY usagecount
ORDER BY usagecount DESC;

Output Example:

 usagecount | blocks |  size
------------+--------+--------
          5 |  12800 | 100 MB
          1 |   2560 | 20 MB
          0 |   1024 | 8 MB

Analysis: If you see a large volume of data with a usagecount of 5, those objects are your most critical hot data.


Method 2: Monitoring Cumulative Statistics (Scans and I/O)

PostgreSQL maintains counters for every table and index. By checking which objects have high idx_scan counts but low heap_blks_read, you can identify data that is frequently accessed and successfully cached.

Command to Find Frequently Scanned Tables:

SELECT relname,
       seq_scan,
       idx_scan,
       n_live_tup
FROM pg_stat_user_tables
ORDER BY idx_scan DESC LIMIT 10;

Understanding the Metrics:

See also: Mastering the Linux Command Line — Your Complete Free Training Guide

  • seq_scan: High numbers here on large tables can “wash out” your cache.
  • idx_scan: High numbers indicate an object is part of the hot data set.
  • blks_hit vs blks_read: In the pg_statio_user_tables view, a high blks_hit indicates the data is staying in RAM.

Method 3: Inspecting Query Behavior with EXPLAIN (ANALYZE, BUFFERS)

When you suspect a specific query is dealing with hot or cold data, you can use the BUFFERS option of EXPLAIN. This shows exactly how many blocks were found in the cache versus how many had to be read from the disk.

Command Example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM customer WHERE id = 101;

Output Details: Look for the “Buffers” line in the output: Buffers: shared hit=3 read=1 This tells you that 3 blocks were already “hot” in memory, while 1 block was “cold” and required a disk read.


Step-by-Step Process to Identify Your Working Set

  1. Enable Statistics: Ensure track_counts and track_io_timing are enabled in your configuration to get accurate I/O data.
  2. Calculate Global Hit Ratio: Run a query against pg_stat_database to see if your overall shared_buffers are large enough for your workload.
  3. Identify Top Objects: Use the queries from Method 2 to list the top 10 most scanned tables and indexes.
  4. Install pg_buffercache: Execute CREATE EXTENSION pg_buffercache; to enable deep inspection.
  5. Verify Object Heat: Run a join between pg_buffercache and pg_class to see which specific tables have the most blocks with a usagecount of 5.
  6. Optimize: For “hot” tables with low cache hit ratios, consider increasing shared_buffers or creating indexes to prevent cache-clearing sequential scans.

Summary of Hot Data Monitoring Tools

View/ExtensionKey ColumnsPrimary Purpose
pg_buffercacheusagecount, relfilenodeSeeing exactly what is in RAM right now.
pg_stat_user_tablesidx_scan, seq_scanTracking access frequency over time.
pg_statio_user_tablesheap_blks_hit, heap_blks_readMeasuring cache effectiveness for specific tables.
EXPLAIN ANALYZEshared hit, shared readAuditing the memory impact of a single query.

FAQs

What does a usagecount of 5 actually mean? In the Postgres buffer management algorithm, each time a page is accessed, its count increases (up to 5). A count of 5 means the page is heavily used and is at the back of the “eviction line” [152, conversation history].

Why is my cache hit ratio high, but the database is still slow? A high hit ratio in pg_stat_database only measures the Postgres cache. You may still be experiencing latency if the OS page cache is struggling or if you have heavy lock contention.

Can I manually “warm up” my hot data after a restart? Yes. You can use the pg_prewarm extension to manually load specific tables or indexes into the shared buffers.


Related Posts

David Cao
David Cao

David is a Cloud & DevOps Enthusiast. He has years of experience as a Linux engineer. He had working experience in AMD, EMC. He likes Linux, Python, bash, and more. He is a technical blogger and a Software Engineer. He enjoys sharing his learning and contributing to open-source.

Articles: 651

Leave a Reply

Your email address will not be published. Required fields are marked *