How to Create a Reliable Testing Dataset with pg_dump and pg_restore

A practical guide to extracting a test dataset from Postgres using pg_dump, pg_restore and psql

As your Postgres database grows, you'll likely need a way to generate a smaller, 'good-enough' dataset that preserves the structure and referential integrity of production but is better suited for testing.

There are several ways to do this, but here's a straightforward approach using pg_dump, pg_restore, psql and GitHub Actions.

Running partial data dumps inside GitHub Actions

You can run pg_dump, pg_restore, and psql from the command line, but sometimes, an automated, reproducible approach is more convenient. To better control when data dumps occur, I use a scheduled GitHub Action to export data from my production database and restore it to a testing database. This method works across different Postgres database providers, but if you're looking for a cost-effective testing environment, consider trying Neon. Check out our getting started guide to see how easy it is to set up.

What is a scheduled GitHub Action?

A scheduled GitHub Action runs automatically at a time you define. Since you're dumping data from a production database, you'll likely want to run this job when the system isn't under heavy load, typically outside of business hours. I usually schedule these jobs for midnight, but since I don't want to stay up that late, a scheduled GitHub Action takes care of it while I sleep.

Getting started with GitHub Actions

To create a GitHub Action, you'll need a GitHub repository to store it. If you don't have one yet, create one now and clone it to your machine for local development.

In the root of your project, create a .github directory. Inside it, add another directory called workflows. Then, within workflows, create a new file named dump-test-data.yml, for example:

.github
  |-- workflows
    |-- dump-test-data.yml

Now add the following code.

There's a lot happening here, so before I get to the pg_dump, pg_restore and psql steps, let me briefly explain what this first part does.

name: Dump Test Data

on:
  schedule:
    - cron: '0 0 * * *' # Runs at midnight UTC
  workflow_dispatch:

env:
  PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database
  DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database
  PG_VERSION: '17'

jobs:
  dump-and-restore:
    runs-on: ubuntu-latest

    steps:
      - name: Install PostgreSQL
        run: |
          sudo apt update
          yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
          sudo apt install -y postgresql-${{ env.PG_VERSION }}

      - name: Set PostgreSQL binary path
        run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV

name

This name will appear in the Actions section of the GitHub UI. Regardless of what you name your file, this is the name that will be displayed.

on

This section of the workflow determines when the Action will run. The schedule field includes a cron expression, which uses POSIX cron syntax to specify how often the Action should execute.

I've also included the workflow_dispatch field, which lets you manually trigger this Action from the GitHub UI—a useful feature while developing and testing the workflow.

env

This workflow requires two environment variables. The first is the Postgres connection string for the source database, typically your production or staging database. The second is the connection string for the target database, which will serve as your testing database. Both need to use the same version of Postgres. Both of these variables will also need to be added to your GitHub repositories secrets.

To do this, navigate to Settings > Settings and variables > Actions and add them under Repository secrets.

Screenshot of GitHub repository secrets

The last variable defines the Postgres version to install in the Action environment. Since pg_dump, pg_restore, and psql depend on Postgres, you'll need to install it within the Action—I’ll cover this in more detail later. It’s also worth noting the version of Postgres you install here should be the same version used by both your source and target database. In my example, all use Postgres 17.

jobs/steps

The job is named dump-and-restore, which will be displayed in the GitHub UI when the Action is running. You can choose any name you prefer.

The first step in the job is to install Postgres. While there are various methods and alternative options available in the GitHub Marketplace, I prefer to install directly from Apt (Advanced Packaging Tool) for added security, especially since you're providing direct access to your production database.

The next step is to define a variable that is needed when using pg_dump, pg_restore, and psql. This variable is named POSTGRES and will be referenced later as $POSTGRES/pg_dump.

Before I jump into the dump/restore parts, I'll quickly explain the schema I've used in this example. It’s important to note the foreign key relationships between the tables.

In my example, the foreign key relationships are as follows:

The transactions table has a foreign key user_id that references the user_id column in the users table. This establishes a relationship where each transaction is linked to a specific user.
The transactions table is linked to the products table through the product_id foreign key. This establishes a relationship where each transaction is associated with a specific product.

users

This is the schema used to create the users table.

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    email VARCHAR(150) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

products

This is the schema used to create the products table.

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    price DECIMAL(10, 2) NOT NULL,
    stock_quantity INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

transactions

This is the schema used to create the transactions table.

CREATE TABLE transactions (
    transaction_id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(user_id) ON DELETE CASCADE,
    product_id INT REFERENCES products(product_id) ON DELETE CASCADE,
    quantity INT NOT NULL,
    total_amount DECIMAL(10, 2) NOT NULL,
    transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) NOT NULL CHECK (status IN ('pending', 'completed', 'failed'))
);

The transactions table in my example relies on data from both the users and products tables. When performing a partial data dump, it's important that transaction rows can reference either a user_id from the users table or a product_id from the products table.

With this in mind, I'll start with the transactions table when deciding which data to include in the partial dump.

Dump and restore partial data

Add the following code after the Set PostgreSQL binary path step.

name: Dump Test Data
on:
  schedule:
    - cron: '0 0 * * *' # Runs at midnight UTC
  workflow_dispatch:
env:
  PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database
  DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database
  PG_VERSION: '17'
jobs:
  dump-and-restore:
    runs-on: ubuntu-latest
    steps:
      - name: Install PostgreSQL
        run: |
          sudo apt update
          yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
          sudo apt install -y postgresql-${{ env.PG_VERSION }}

      - name: Set PostgreSQL binary path
        run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV

     - name: Dump schema
        run: |
          $POSTGRES/w "${{ github.workspace }}/all-schema.bak" "${{ env.PROD_DATABASE_URL }}"

      - name: Dump data
        run: |
          $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50) TO '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER"
          $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER"
          $POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER"

      - name: Drop tables and schema
        run: |
          $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "DROP SCHEMA IF EXISTS public CASCADE;"
          $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "CREATE SCHEMA public;"

      - name: Restore schema
        run: |
          $POSTGRES/pg_restore --clean --no-owner --no-acl --if-exists --schema-only -d "${{ env.DEV_DATABASE_URL }}" "${{ github.workspace }}/all-schema.bak"

      - name: Restore data
        run: |
          $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.users FROM '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER"
          $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.products FROM '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER"
          $POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.transactions FROM '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER"

The above code snippet might look a bit complicated at first, but it’s actually not that bad—let me break it down for you.

Dump schema

In this step, I use pg_dump to export the entire schema from the production database and save it to the GitHub workspace as a file named all-schema.bak. This file is stored in memory so it can be accessed later by the Restore schema step towards the end of the job.

The flags used in this step are explained below:

Flag	Meaning
`-Fc`	Dumps the database in a custom format.
`--schema-only`	Dumps only the schema (table structures, indexes, constraints) without any data.
`-f`	Specifies the output file where the schema dump will be stored.

Dump data

In this step, I use psql to query the data. This is the most complex step, involving three SQL queries, each targeting one of the three tables. The queries are as follows:

Transactions query

This query selects the 50 most recent transactions from the transactions table. Depending on your requirements, you can increase the LIMIT or modify the query.

SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50

The results are saved to the GitHub workspace memory as a file called transactions-subset.csv, which will be used in a later step.

Products query

This query selects products, but only those with a product_id present in the 50 most recent transactions:

SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)

The results are saved to the GitHub workspace memory as a file called products-subset.csv, which will be used in a later step.

Users query

This query selects users, but only those with a user_id present in the 50 most recent transactions:

SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)

The results are saved to the GitHub workspace memory as a file called users-subset.csv, which will be used in a later step.

Drop tables and schema

In this step, I use psql to drop the schema and create a fresh one. Since this Action runs on a schedule, this cleanup ensures the target database is ready for new schema and data, avoiding any errors from schema changes since the last run.

The queries used in this step are explained below:

Step	Effect
`DROP SCHEMA IF EXISTS public CASCADE;`	Removes the public schema and everything inside it.
`CREATE SCHEMA public;`	Recreates the public schema for a clean restore process.

Restore schema

In this step, I use pg_restore to restore the schema from the all-schema.bak file, which was saved to the GitHub workspace during the Dump schema step.

The flags used in this step are explained below:

Flag	Meaning
`--clean`	Drops existing database objects before recreating them, ensuring a clean restore.
`--no-owner`	Ignores ownership information in the dump file, so restored objects are owned by the user running the restore.
`--no-acl`	Excludes access control (GRANT/REVOKE) statements from the restore, preventing permission changes.
`–if-exits`	Ensures that DROP commands (used with --clean) only execute if the object exists, preventing errors.
`--schema-only`	Restores only the schema (table structures, indexes, constraints) without inserting any data.
`-d`	Specifies the target database to restore into.

Restore data

In this step, I use psql to restore the data to the target database from the .csv files generated in the Dump data step.

Finished

Once the Action completes successfully, your target database will have a fresh test data set ready for use!

This Action is part of our Dev/Test use case, widely used by Neon customers who face limitations with traditional databases for testing. By leveraging a dedicated Neon database, while leaving production environments where they are, developers gain access to Neon's full suite of features, including the built-in SQL editor, table explorer, and branching.

If you'd like to learn more about using Neon for testing, check out our docs or contact our sales team.

What is Neon?

How to Create a Reliable Testing Dataset with pg_dump and pg_restore

Running partial data dumps inside GitHub Actions

What is a scheduled GitHub Action?

Getting started with GitHub Actions

name

on

env

jobs/steps

users

products

transactions

Dump and restore partial data

Dump schema

Dump data

Transactions query

Products query

Users query

Drop tables and schema

Restore schema

Restore data

Finished

On this page