As your Postgres database grows, you'll likely need a way to generate a smaller, 'good-enough' dataset that preserves the structure and referential integrity of production but is better suited for testing.
There are several ways to do this, but here's a straightforward approach using pg_dump
, pg_restore
, psql
and GitHub Actions.
Running partial data dumps inside GitHub Actions
You can run pg_dump
, pg_restore
, and psql
from the command line, but sometimes, an automated, reproducible approach is more convenient. To better control when data dumps occur, I use a scheduled GitHub Action to export data from my production database and restore it to a testing database. This method works across different Postgres database providers, but if you're looking for a cost-effective testing environment, consider trying Neon. Check out our getting started guide to see how easy it is to set up.
What is a scheduled GitHub Action?
A scheduled GitHub Action runs automatically at a time you define. Since you're dumping data from a production database, you'll likely want to run this job when the system isn't under heavy load, typically outside of business hours. I usually schedule these jobs for midnight, but since I don't want to stay up that late, a scheduled GitHub Action takes care of it while I sleep.
Getting started with GitHub Actions
To create a GitHub Action, you'll need a GitHub repository to store it. If you don't have one yet, create one now and clone it to your machine for local development.
In the root of your project, create a .github
directory. Inside it, add another directory called workflows
. Then, within workflows
, create a new file named dump-test-data.yml
, for example:
.github
|-- workflows
|-- dump-test-data.yml
Now add the following code.
There's a lot happening here, so before I get to the pg_dump
, pg_restore
and psql
steps, let me briefly explain what this first part does.
name: Dump Test Data
on:
schedule:
- cron: '0 0 * * *' # Runs at midnight ET (us-east-1)
workflow_dispatch:
env:
PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database
DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database
PG_VERSION: '17'
jobs:
dump-and-restore:
runs-on: ubuntu-latest
steps:
- name: Install PostgreSQL
run: |
sudo apt update
yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install -y postgresql-${{ env.PG_VERSION }}
- name: Set PostgreSQL binary path
run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV
name
This name will appear in the Actions section of the GitHub UI. Regardless of what you name your file, this is the name that will be displayed.
on
This section of the workflow determines when the Action will run. The schedule
field includes a cron
expression, which uses POSIX cron syntax to specify how often the Action should execute.
I've also included the workflow_dispatch
field, which lets you manually trigger this Action from the GitHub UI—a useful feature while developing and testing the workflow.
env
This workflow requires two environment variables. The first is the Postgres connection string for the source database, typically your production or staging database. The second is the connection string for the target database, which will serve as your testing database. Both need to use the same version of Postgres. Both of these variables will also need to be added to your GitHub repositories secrets.
To do this, navigate to Settings > Settings and variables > Actions and add them under Repository secrets.
The last variable defines the Postgres version to install in the Action environment. Since pg_dump
, pg_restore
, and psql
depend on Postgres, you'll need to install it within the Action—I’ll cover this in more detail later. It’s also worth noting the version of Postgres you install here should be the same version used by both your source and target database. In my example, all use Postgres 17.
jobs/steps
The job is named dump-and-restore
, which will be displayed in the GitHub UI when the Action is running. You can choose any name you prefer.
The first step in the job is to install Postgres. While there are various methods and alternative options available in the GitHub Marketplace, I prefer to install directly from Apt (Advanced Packaging Tool) for added security, especially since you're providing direct access to your production database.
The next step is to define a variable that is needed when using pg_dump
, pg_restore
, and psql
. This variable is named POSTGRES
and will be referenced later as $POSTGRES/pg_dump
.
Before I jump into the dump/restore parts, I'll quickly explain the schema I've used in this example. It’s important to note the foreign key relationships between the tables.
In my example, the foreign key relationships are as follows:
- The transactions table has a foreign key
user_id
that references theuser_id
column in the users table. This establishes a relationship where each transaction is linked to a specific user. - The transactions table is linked to the products table through the
product_id
foreign key. This establishes a relationship where each transaction is associated with a specific product.
users
This is the schema used to create the users
table.
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(150) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
products
This is the schema used to create the products
table.
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description TEXT,
price DECIMAL(10, 2) NOT NULL,
stock_quantity INT DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
transactions
This is the schema used to create the transactions
table.
CREATE TABLE transactions (
transaction_id SERIAL PRIMARY KEY,
user_id INT REFERENCES users(user_id) ON DELETE CASCADE,
product_id INT REFERENCES products(product_id) ON DELETE CASCADE,
quantity INT NOT NULL,
total_amount DECIMAL(10, 2) NOT NULL,
transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) NOT NULL CHECK (status IN ('pending', 'completed', 'failed'))
);
The transactions table in my example relies on data from both the users and products tables. When performing a partial data dump, it's important that transaction rows can reference either a user_id
from the users table or a product_id
from the products table.
With this in mind, I'll start with the transactions
table when deciding which data to include in the partial dump.
Dump and restore partial data
Add the following code after the Set PostgreSQL binary path step.
name: Dump Test Data
on:
schedule:
- cron: '0 0 * * *' # Runs at midnight ET (us-east-1)
workflow_dispatch:
env:
PROD_DATABASE_URL: ${{ secrets.PROD_DATABASE_URL }} # Production or staging database
DEV_DATABASE_URL: ${{ secrets.DEV_DATABASE_URL }} # Development or testing database
PG_VERSION: '17'
jobs:
dump-and-restore:
runs-on: ubuntu-latest
steps:
- name: Install PostgreSQL
run: |
sudo apt update
yes '' | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install -y postgresql-${{ env.PG_VERSION }}
- name: Set PostgreSQL binary path
run: echo "POSTGRES=/usr/lib/postgresql/${{ env.PG_VERSION }}/bin" >> $GITHUB_ENV
- name: Dump schema
run: |
$POSTGRES/w "${{ github.workspace }}/all-schema.bak" "${{ env.PROD_DATABASE_URL }}"
- name: Dump data
run: |
$POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50) TO '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER"
$POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER"
$POSTGRES/psql "${{ env.PROD_DATABASE_URL }}" -c "\copy (SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)) TO '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER"
- name: Drop tables and schema
run: |
$POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "DROP SCHEMA IF EXISTS public CASCADE;"
$POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "CREATE SCHEMA public;"
- name: Restore schema
run: |
$POSTGRES/pg_restore --clean --no-owner --no-acl --if-exists --schema-only -d "${{ env.DEV_DATABASE_URL }}" "${{ github.workspace }}/all-schema.bak"
- name: Restore data
run: |
$POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.users FROM '${{ github.workspace }}/users-subset.csv' WITH CSV HEADER"
$POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.products FROM '${{ github.workspace }}/products-subset.csv' WITH CSV HEADER"
$POSTGRES/psql "${{ env.DEV_DATABASE_URL }}" -c "\copy public.transactions FROM '${{ github.workspace }}/transactions-subset.csv' WITH CSV HEADER"
The above code snippet might look a bit complicated at first, but it’s actually not that bad—let me break it down for you.
Dump schema
In this step, I use pg_dump
to export the entire schema from the production database and save it to the GitHub workspace as a file named all-schema.bak
. This file is stored in memory so it can be accessed later by the Restore schema step towards the end of the job.
The flags used in this step are explained below:
Flag | Meaning |
---|---|
-Fc | Dumps the database in a custom format. |
--schema-only | Dumps only the schema (table structures, indexes, constraints) without any data. |
-f | Specifies the output file where the schema dump will be stored. |
Dump data
In this step, I use psql
to query the data. This is the most complex step, involving three SQL queries, each targeting one of the three tables. The queries are as follows:
Transactions query
This query selects the 50 most recent transactions from the transactions
table. Depending on your requirements, you can increase the LIMIT
or modify the query.
SELECT * FROM transactions ORDER BY transaction_id DESC LIMIT 50
The results are saved to the GitHub workspace memory as a file called transactions-subset.csv
, which will be used in a later step.
Products query
This query selects products, but only those with a product_id
present in the 50 most recent transactions:
SELECT * FROM products WHERE product_id IN (SELECT product_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)
The results are saved to the GitHub workspace memory as a file called products-subset.csv
, which will be used in a later step.
Users query
This query selects users, but only those with a user_id
present in the 50 most recent transactions:
SELECT * FROM users WHERE user_id IN (SELECT user_id FROM transactions ORDER BY transaction_id DESC LIMIT 50)
The results are saved to the GitHub workspace memory as a file called users-subset.csv
, which will be used in a later step.
Drop tables and schema
In this step, I use psql
to drop the schema and create a fresh one. Since this Action runs on a schedule, this cleanup ensures the target database is ready for new schema and data, avoiding any errors from schema changes since the last run.
The queries used in this step are explained below:
Step | Effect |
---|---|
DROP SCHEMA IF EXISTS public CASCADE; | Removes the public schema and everything inside it. |
CREATE SCHEMA public; | Recreates the public schema for a clean restore process. |
Restore schema
In this step, I use pg_restore
to restore the schema from the all-schema.bak
file, which was saved to the GitHub workspace during the Dump schema step.
The flags used in this step are explained below:
Flag | Meaning |
---|---|
--clean | Drops existing database objects before recreating them, ensuring a clean restore. |
--no-owner | Ignores ownership information in the dump file, so restored objects are owned by the user running the restore. |
--no-acl | Excludes access control (GRANT/REVOKE) statements from the restore, preventing permission changes. |
–if-exits | Ensures that DROP commands (used with --clean) only execute if the object exists, preventing errors. |
--schema-only | Restores only the schema (table structures, indexes, constraints) without inserting any data. |
-d | Specifies the target database to restore into. |
Restore data
In this step, I use psql
to restore the data to the target database from the .csv
files generated in the Dump data step.
Finished
Once the Action completes successfully, your target database will have a fresh test data set ready for use!
This Action is part of our Dev/Test use case, widely used by Neon customers who face limitations with traditional databases for testing. By leveraging a dedicated Neon database, while leaving production environments where they are, developers gain access to Neon's full suite of features, including the built-in SQL editor, table explorer, and branching.
If you'd like to learn more about using Neon for testing, check out our docs or contact our sales team.