PostgreSQL/Local Cache

From Jonathan Gardner's Tech Wiki
Jump to: navigation, search

Overview

If you want to use PostgreSQL as a local cache you can run arbitrarily complicated queries against, here's how.

Tools

For this, you'll need to really understand the nuts and bolts of PostgreSQL. I'll walk you through them. The most important tools are:

  • tmpfs: A way to create files in memory that never leave memory (except to swap out when not in use.)
  • initdb: A way to create new installations for PostgreSQL. Simply run "initdb data_directory" and it will do the right thing.
  • postgres: The actual server process. Start up the server with "postgres -d data_directory"
  • PostgreSQL/Configuration: How to configure PostgreSQL.
  • Your OS's virtual memory management. If your OS is doing things right, it will keep the pages for the files in memory. (If not, you can force it with something like tmpfs.) As usual, PostgreSQL relies on proper caching at the OS level, so leave plenty of memory for that.

Best Recommendation

Right now, the best speed I can get is the following configuration:

  • tmpfs for everything BUT the log files.
  • fsync=off
  • synchronous_commit=off

If you don't want to use tmpfs, then just using fsync=off and synchronous_commit=off will get you extremely close to the same performance. YMMV.

full_page_writes=off didn't seem to help, but rather hurt performance. Of course, my testing was limited.

Startup Procedure

To startup a fresh cache:

  • Clear out the old data. Just throw it away. There's a good chance it got corrupted anyway.
  • Use initdb to create your installation.
  • Replace the configuration file or point postgres to a configuration file that points to your data directory.
  • Startup the server with "postgres". If you don't want to manage the process, use "pg_ctl".
  • Create the tables for your cache.
  • Populate and query and refresh, ad nauseum.
  • The moment things go bad, rebuild from scratch.

Shutdown

Simply kill the "postgres" process, preferrably without -9. You can use pg_ctl to do this.

Nice things

  • You can login and play around with live data.
  • You get very high concurrency, even with DML, for free.
  • PostgreSQL is extremely fast in this mode, challenging other so-called fast databases while providing more useful features for free.

Look out for these things

  • If you don't have proper indexing, you won't use them.
  • You may want to run vacuum analyze when you first get your cache populated, rather than waiting for autovacuum to do so. Or, you may want to make autovacuum very sensitive.
  • Try doing larger commits. PostgreSQL doesn't suffer from the problems other databases have, provided the larger commits won't trip over each other. Note, however, that moving the commit to later didn't give me a noticeable improvement in performance. However, when I have fsync turned on, then it did give me massive performance improvements.