Post image

As beautiful as a greenfield database is, it’s useless without data. This is not just for production–tests, development, staging, and demos–all need data for proper functionality.

This is where seed files come in. Seed files are scripts that populate your database with initial data: admin roles, initial categories, and basic users. This seems like a simple concept. But as with everything in development, what seems simple at the outset becomes complex as you scale.

Seeding needs management just like anything else. So, how do you do that?

How to Maintain Database Seed Files

From Reddit:

The seeds are split up into multiple files based on the specific environment you’re running in.

The actual seed data is stored in YAML-based files which look and feel similar to test fixtures.

seeds.rb checks the environment, loads the environment-specific seed file, which grabs all the YAML files in the environment-specific data directory and then passes each one to a special-made class that’s only job is to parse the YAML and create the objects in the database.

I don’t think you’d need to go quite this far. The way we do things is specific to us and reuses a lot of code that’s also used to onboard new customers.

As far as making sure they’re kept up to date, that’s something you’d handle during code review. If there’s a database change without a corresponding change to the seed data, you call it out during review and block merges until it’s fixed.

Managing seed files doesn’t have to be complex, but it does require intentional organization. As our Reddit user demonstrates, you can create sophisticated systems with environment-specific files and YAML parsing. While you might not need that level of complexity, this core approach–separating environments and automating the process–is spot on.

Version your seed files

Your seed files should be treated as first-class citizens in your codebase. Like any other code, they must be versioned, reviewed, and maintained. When developers make schema changes, the corresponding seed modifications should be part of the same pull request. Without this, tests and environments will break downstream.

Something as simple as this can work:

## Database Changes
- [ ] Schema changes made? If yes:
  - [ ] Corresponding seed files updated
  - [ ] Seed file changes tested in clean environment
  - [ ] Idempotency verified

Automate execution

Having seeds as part of your codebase also makes automation easier. Manual seeding is tedious and error-prone. Build seed execution into your deployment pipeline. For development environments, consider using Docker Compose or similar tools to automate initialization:

version: '3'
services:
  db:
    image: postgres:latest
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
  app:
    build: .
    depends_on:
      - db
    command: sh -c "bundle exec rails db:setup && bundle exec rails db:seed"

Structure by environment

Different environments have distinct data requirements. Development environments might need comprehensive dummy data for UI testing, while test environments typically need minimal, predictable datasets for reliable test execution. Keep these separate to prevent cross-environment contamination.

seeds/
├── development/
   ├── 01_users.yml
   ├── 02_products.yml
   └── 03_orders.yml
├── test/
   ├── minimal_users.yml
   └── test_scenarios.yml
└── shared/
    └── lookup_tables.yml

Make seeds incremental

Related is the idea of implementing an incremental seeding system rather than maintaining one monolithic seed file. This approach mirrors database migrations, where each seed file corresponds to specific schema versions or feature additions:

# 20240128_add_user_roles.rb
class UserRoleSeeder
  def self.seed
    return if Role.exists?

    Role.create!([
      { name: 'admin', permissions: ['all'] },
      { name: 'editor', permissions: ['create', 'update'] },
      { name: 'viewer', permissions: ['read'] }
    ])
  end
end

Make seeds idempotent 

The golden rule of seed files: running them twice should be the same as running them once. This means checking if data exists before creating it, having clean reset procedures, and handling edge cases gracefully. For example:

def seed_admin_role
  return if Role.exists?(name: 'admin')
  Role.create!(
    name: 'admin',
    permissions: ['all']
  )
end

Why Seed Files Cause Problems

For all their usefulness, seed files can become a significant source of friction in development. It’s a classic case of “simple until it isn’t.” Performance degradation occurs as seed data volumes grow. The initial database setup time increases, impacting developer productivity and CI pipeline execution times. 

This is caused by three core problems:

  1. Maintenance overhead is significant because seed files must stay synchronized with schema changes. Each modification to the database structure–adding columns, changing relationships, updating constraints–requires corresponding updates to seed data. When these updates are missed, it causes cascading failures across environments: test suites break, CI pipelines fail, and development environments become unusable.
  2. Environmental inconsistency occurs when seed data diverges between environments. Development, testing, and staging environments often require different data sets, leading to environment-specific bugs that are difficult to reproduce. This is especially problematic when trying to debug issues that only manifest with specific data patterns or volumes.
  3. Referential integrity management becomes increasingly complex with interconnected data models. Seed files must create records in the correct order to satisfy foreign key constraints and other database relationships. This complexity multiplies with each new model relationship, making seed file maintenance increasingly difficult.

These technical limitations don’t invalidate seed files’ utility but require careful consideration in database design and development workflows. Understanding these constraints helps inform decisions about seed file architecture and whether alternative approaches might be more appropriate for specific use cases.

One alternative is factory bots. Factory bots provide a programmatic alternative to static seed files, defining templates for generating test data dynamically. 

A factory defines a blueprint for object creation, handling both attributes and associations:

FactoryBot.define do
  factory :user do
    sequence(:email) { |n| "user#{n}@example.com" }
    name { "Test User" }
    
    factory :admin do
      after(:create) do |user|
        user.roles << create(:role, name: 'admin')
      end
    end

    trait :with_orders do
      after(:create) do |user|
        create_list(:order, 3, user: user)
      end
    end
  end

  factory :order do
    user
    total { rand(100..1000) }
    status { ['pending', 'completed', 'cancelled'].sample }
  end
end

Factories generate unique data for each instance using sequences and dynamic attributes, reducing data collisions in test environments. They also allow for randomization where appropriate while maintaining deterministic values where needed.

Factories address the core problems of seed files:

  1. Maintenance. Factory definitions live alongside model code. Changes to models automatically surface incompatibilities with factories, and CI can verify factory validity without creating complete data.
  2. Environmental Consistency. The same factory definitions work across all environments with data variations handled through traits and parameters rather than environment-specific files.
  3. Performance. Objects are created on-demand rather than bulk-loaded, and you have selective creation of only necessary objects and associations.

However, factories aren’t a complete replacement for seed files. They excel in test environments but may not be suitable for production data initialization, demo environment setup where consistent, curated data is needed, or complex data scenarios that require specific, predefined relationships.

So, what’s a DBA to do?

Use Database Branching

Database branching allows you to create isolated copies of your database that share the same underlying storage but can evolve independently. Instead of version-controlling your seed files, you control your entire database.

Database branching isn’t one-size-fits-all. You’ve got two main flavors: branches with data and schema-only branches. Each serves different purposes and comes with its own trade-offs.

Data & Schema Branches

When you create a branch with data and schema, you’re getting an exact copy of your database. This includes all your tables, relationships, indexes, and—crucially—the data itself.

Think of it like cloning your production environment. Everything comes along for the ride: user records, transaction history, product catalogs—whatever lives in your database. Since it’s a copy-on-write clone, creating these branches is lightning-fast, even with terabytes of data.

This approach shines when you need to:

  • Debug issues that only show up with real data patterns
  • Test query performance with production-like data volumes
  • Verify how your changes will behave in the real world

This is the ideal option when real data patterns are crucial and performance testing matters.

Schema-Only Branches

Sometimes, you don’t want all that production data. Maybe you’re dealing with sensitive information or need a clean slate for testing. That’s where schema-only branches come in.

With a schema-only branch, you get the structure but no data. It’s like getting an empty house with all the rooms laid out—you’ll need to furnish it yourself. This is where seeding comes into play.

After creating a schema-only branch, you have options for populating it:

  • Run your seed scripts
  • Generate synthetic test data
  • Load sanitized production data
  • Use your test fixtures

This is the option for when you’re working with sensitive data (though we are working on data anonymization), need pristine test environments, and/or want controlled test scenarios.

Neon is leading the way in data and schema branching, with schema-only branches on the horizon. While other providers offer similar features, Neon’s branching-based approach sets a new standard by removing the challenges of traditional seeding and enhancing the developer experience for you and your team.

Ready to dive in? Start for free with Neon today.